The Sampling Distribution of Means

Calculation of Areas Under the Normal Curve

Figure 1. The Normal Curve.

We want to calculate the probability that a random event occurs, given that the probability distribution function is Normal with mean and standard deviation . We may numerically calculate the probability

We can do this as follows: The probability of exceeding a certain value x in a normal curve is denoted by Q(x). We, first, normalize the values of x by putting:

Then, the area under the normal curve to the right of x is:

where

and

p =+0.2316419
b1=+0.319381530
b2=-0.356563782
b3=+1.781477937
b4=-1.821255978
b5=+1.330274429

The probability that a given x is between to values a and b is given, simply, bywhere a < b.

I. The Calculation of the Probabilities for a Normal Distribution

0) The following are the constants for calculating the area under the Normal curve:

p =+0.2316419

b1=+0.319381530

b2=-0.356563782

b3=+1.781477937

b4=-1.821255978

b5=+1.330274429

1) Normalize:

a) offset =

b) scale =

2) Normalize the data

Nb=(b-)/

Na=(a-)/

Nalpha=(alpha-)/

3) Calculate the probability of exceeding "b"

fQ=Q(Nb)

4) Calculate the probability of not reaching "b”

P=1-fQ

5) Calculate the probability of exceeding "a" and "b"

Qa=Q(Na)

Qb=Q(Nb)

6) Calculate the probability of being between "a" and "b"

a_b=Qa-Qb

7) To calculate the prob. of a "point", define a very narrow band (0.1 sd's)

alpha1=Nalpha-.05

alpha2=Nalpha+.05

Q1=Q(alpha1)

Q2=Q(alpha2)

P_alpha=Q1-Q2

Function Q calculates the area under the normal curve by determining whether the sought for area is to the right or left of the mean value.

a) If to the right, area is calculated from a direct numerical approximation

b) If exactly on the middle p=1/2

c) If to the left, calculate the probability from the symmetry of the curve

It uses two other functions:

i) Z()

ii) N()

function Q(arg)

if arg>0

return Z(arg)*N(arg)

endif

if arg=0

return 0.5

endif

if arg<0

return 1-Z(-arg)*N(-arg)

endif

function Z(arg)

return k*e(-arg*arg/2)

function N(arg)

t=1/(1+p*arg)

return *(b1+t*(b2+t*(b3+t*(b4+t*b5))))

II. The Generation of Normally Distributed Numbers

An approximation to normally distributed random numbers y may be found from a sequence of uniform random numbers [System/360 Scientific Subroutine Package] using the formula:

where

a) xi is a uniformly distributed random number

y approaches true normality asymptotically as k approaches infinity.

To reduce execution time, make k=12.

Then:

For a given set and :

III. The Goodness_of_Fit Test

The theoretical distribution may be approximated using Chebyshev’s polynomials of degree 4. Then:

For 0.005 confidence level, the coefficients (c0, ..., c4) are:

For 0.01 confidence level, the coefficients are:

For 0.05 confidence level, the coefficients are:

If the statistic, calculated from exceeds the theoretical , calculated from the polynomials, then the probability that the distribution is NOT normal equals the stipulated confidence level. In other words:

where P(Z) is the probability that the distribution is gaussian, r is the number of intervals and l is the confidence level.

Sampling Distribution of Means

Statistical Methodology

We want to answer the following. Q: For any given algorithm A(i), what is the probability that we find a certain minimum value (denoted by ) for any given that A(i) is iterated G times?

Since one of our premises is that the be selected randomly from we do not know, a priori, anything about the probability distribution function of the ’s. To answer Q we rely on the following known theorems from statistical theory.

T1) Any sampling distribution of means (sdom) is distributed normally for a large enough sample size n.

Remark: This is true, theoretically, as . However, it is considered that any n>20 is satisfactory. We have chosen n=36.

T2) In a normal distribution (with mean and standard deviation ) approximately 1/10 of the observations lie in the intervals: –5 to -1.29; -1.29 to –0.85; -0.85 to –0.53; -0.53 to –0.26; -0.26 to and the symmetrical to 0.26, etc.

Remark: These deciles divide, therefore, the area under the normal curve in 10 unequally spaced intervals. The expected number of observed events in each interval will, however, be equal.

T3) The relation between the population distribution’s parameters [which we denote with (the mean) and (the standard deviation)] and the sdom’s parameters (which we denote with and) is given by and .

Remark: In our case .

T4) The proportion of any distribution found within k standard deviations of the mean is, at least, 1-1/k2.

Remark: Chebyshev’s bound generality makes it quite a loose one. Tighter bounds are achievable but they may depend on the characteristics of the distribution under study. We selected k = 4, which guarantees that our observations will occur with probability = 0.9375.

T5) For a set of r intervals, a number of Oi observed events in the i-th interval, a number of expected Ei events in the i-th interval, p distribution parameters and degrees of freedom, the following equation holds.

/ (1)

where probability that the distribution is normal.

Remarks: The summation on the left of (4) is the statistic; the polynomial to the right of the inequality sign (call it ) is a least squares Chebyshev polynomial approximation to the theoretical for a 95% confidence level. In our case, for which 14.0671. Furthermore, if we choose the deciles as above, we know that , where is a sample of size n. A further condition normally imposed on this goodness-of-fit test is that a minimum number of observations (usually between 3 and 5) be required in each interval. Thus, (4) is replaced by

/ (2)

Making and using the parameters’ values described above, equation (5) finally takes the following form.

/ (3)

We assume that we are exploring a set of algorithms. These are to be characterized by a set of parameters (from 1 to n) and a method, which is, actually, a meta-heuristic (from 1 to m).

Algorithm for the Determination of the Distribution’s Parameters (of a meta-heuristic)

1. (determine the parameter set)

3. (count the number of samples)

4. (count the elements of a sample)

5. A function is selected randomly from the suite.

6. Experiment is performed with this function and a) the best value and b) the number of satisfied constraints are stored.

8. If j 36, go to step 5 (a sample size of 36 guarantees normality).

9. The average of the best fitness’ values is calculated.

10.

11. If i 50, go to step 4

12. According to the central limit theorem, the distribute normally. We, therefore, define 10 intervals which are expected to hold 1/10 of the samples assuming a normal distribution: i.e., the intervals are standardized. If the samples are indeed normally distributed the following 2 conditions should hold.

a) At least 5 observations should be found in each of the 10 intervals (which explains why we test for 50 in step 11).

b) The values of a goodness of fit test should be complied with (which we demand to be in the 95% confidence level).

We, therefore, check for conditions (a) and (b) above. If they have not been reached, go to step 4.

13. Once we are assured (with probability = 0.95) that the ’s are normally distributed, we calculate the mean and standard deviation of the sampling distribution of the measured mean values of the best fitnesses for this experiment. Moreover, we may calculate the mean and the standard deviation of the distribution of the best values (rather than the means) from . Notice that, therefore, we characterize the statistical behavior of experiment quantitatively.

14. . If m, go to step 3.

15. . If n, go to step 2.

Program for the Sampling Distribution of Means

We start by assuming the following data:

Next, we define samples of size 3. For example: (3,3,3); (3,6,3); (4,6,10), etc. There are 1000 possible samples of this size.

1. Directly from the table we calculate the mean and standard deviation, which turn out to be 4.500 and 0.793 respectively. These are the parameters () and () of the population.

2. Then we calculate the sampling distribution of means by generating all possible samples [(3,3,3);(3,3,4);(3,3,5);...;(10,9,10);(10,10,10)], the average (or mean) for every sample (3, 3.333, 3.667,...,9.667, 10) and the resulting mean and standard deviation of the averages (means). This yields: = 4.500 and = 0.458. From the CLT we know that ; and that .

3. Now we sample the population as per the algorithm above, to get , and .

In the figure above we show the intervals under the Normal curve, the number of observations in each interval and the percentage of the total number (81) for which = 3.814.

4. Finally, we calculate the estimators and , getting and .