Key Facts for S2 OCR (MEI)

Poisson Distribution

This distribution applies to events that happen randomly, are independent of each other and happen at a finite and constant average rate. It is usually applied to rare events but this is not fundamental unless using it as an approximation to the binomial distribution.

The parameter is λ, the long term mean. The distribution is

P( X = r) = e-λ λr / r! r >= 0

E(X) = λ

Var(X) = λ

It sometimes makes the sums easier to use the recurrence relation

P(X = r) = P(X = r-1) x λ/r

The sum of two or more Poisson distributions is also Poisson. The resulting mean (and therefore also variance) is the sum of the individual means.

If you are asked to test whether data is Poisson distributed:

Check that the variance is approximately equal to the mean.

Work out the relative frequencies using the parameter λ.

Check they agree with the observed frequencies.

The Poisson distribution is an excellent approximation to the binomial distribution B(n,p) provided that the event is rare i.e. p is small and the number n is large.

λ = np must also not be too large (typically less than 10).

Examples might be the number of vehicles going down a lonely country lane and the number of defects per km in electric cable.

Normal Distribution

Many naturally occurring things, such as the height of a person, number of dandelions per square mile etc can be accurately modelled using the Normal Distribution. This also applies to other things when the outcome is the result of many independently varying components.

The distribution is symmetrical about the mean and the width depends on the variance. So we would need hundreds of different curves to model all the things that vary with a normal distribution. Fortunately there is an easy way to normalise the problem so that just one curve can be used.

The normalised curve is given the parameter z and is referred to as Φ(z). It has a mean of zero and a variance of 1.

We are always interested in the area under part of the curve. The symmetry above leads to possible exam questions. Tables always give the cumulative probability

P(Z < n) i.e. the area to the left of z = n.

Remember that P(Z < n) = 1 – P( Z > n)

Because of the symmetry only values of n > 0 are given. You have to remember that

P(Z < n) where n < 0 is the same as P(Z > -n) = 1 – P( Z < -n)

In practice you would sketch the curve and this would be obvious. Practice it, it will come up .

In order to use the tables with real problems the data is normalised so that

Z = (X – μ) / σ

Examiners love this. They can give two sets of data that lead to two simultaneous equations in μ and σ. You have to solve these.

For the Poisson distribution as λ increases then the curve becomes more symmetrical and bell shaped. Therefore the Normal Distribution N(λ,λ) can be used as an approximation to the Poisson distribution for large λ. In practice this is done for values larger than 10.

Similarly for a Binomial Distribution (S1) if p is not too close to 0 or one, and n is large then the curve is again bell shaped and symmetrical and the Normal Distribution N(np,npq) can be used.

The Normal Distribution is a continuous distribution. However, it can be used with discrete data provided that the steps in possible values are small compared with the standard deviation and provided an end correction is applied.

Sampling and Hypothesis Testing

For samples of size n drawn from a Normal distribution with mean μ and variance σ2 the distribution of sample means is normal with mean μ and variance σ2/n

The standard deviation of the means is σ/√n

The sampled data can be used to test the null hypothesis that the population has some particular value μ0. The test statistic is x-μ0/(σ/√n)

Contingency Tables

If data is collected for two variables that can each take on two values and it is presented in a two by two table then it is said to form a 2 x 2 contingency table. More variables and parameters give bigger tables e.g. 3 x 4.

The question arises whether the data for each of the variables is independent or associated. This is tested using the χ2 distribution curve as follows:

Present the data as an m x n contingency table with the first variable having m possible categories of results and the second n categories. Each cell is then the observed frequency of the particular m and n combination of categories.

Calculate the marginal (i.e. row and column) totals for the tables.

Calculate the expected frequency in each cell using

(row total x column total) / sample size

Calculate the X2 statistic where X2 = Σ ((fo – fe)2 / fe

Where fo is the observed frequency and fe is the expected frequency in each cell

Calculate the degrees of freedom for the test using (m-1)(n-1) for an m x n table

Look up the critical value for χ2 for the required significance level in the table in your formula book

If the value of X2 is less than this critical value then the null hypothesis Ho is accepted i.e. the variables are independent.

The scope for errors in the arithmetic is large – good luck

Correlation

The independent or controlled variable, if there is one, should be plotted along the x axis and dependent variable along the y axis.

For S1 you need to be able to quantify this amount of correlation. There are many such measures. The one you need for S1 is called Pearson’s Product Moment Correlation Coefficient. Fortunately the rather complicated formula is in the book and can be broken down into components called product moments - Sxx, Syy and Sxy.

Sxx = Σ(xi – xm)2 where the data have values xi and mean xm

Syy = Σ(yi – ym)2

andSxy = Σ(xi – xm)(yi – ym)

These can also be written in terms of Σxi2 etc as in the formula book. The data will usually be given so that these formulae can be applied easily.

The value will probably be near +- 1 corresponding to strong positive or negative correlation.

Sometimes the data will have to be plotted or be given in graphical form. The correlation coefficient only has meaning if the underlying relationship is linear. There may be very strong correlation corresponding to some other relationship (e.g. square law) in which case the coefficient has no meaning.

It could be that there is no correlation between the variables and that by chance the actual points on the graph just happen to show correlation. The PMCC can be used as a test statistic to see if this could be true at a particular significance level.

For S2 the null hypothesis is always that there is no correlation and the result has happened by chance.

The test is simply to look up the critical value for the PMCC in tables for the given number of points, n, and the required significance level. If the critical value is further from 0 than the actual value then at the given significance level the alternative hypothesis is not accepted. Care must be taken with regard to one or two tail tests. If the test is simply whether there is any correlation then the two tail column should be used. If it is specifically testing whether there is, say, positive correlation then the one tail column should be used.

Sometimes data is presented in terms of ranks e.g. the results of two judges scoring a skating competition. In this case the correlation coefficient is usually expressed as Spearmans coefficient of rank correlation and is given by (in the data book)

rc = 1 - 6 Σ di2/(n(n2-1)) where di is the difference in rank given to the i th item.

In fact the numerical answer is the same as would be obtained by the product moment coefficient but the working is much simpler.

In general, however, the product moment should only be used where the data is related in a linear manner, whereas the Rank coefficient is appropriate for any relationship that generally increases or decreases with the other.

Once again the correlation coefficient can be used as a test statistic just as for the PMCC case. There are separate tables with different critical values for the Rank correlation coefficient. Be careful to use the right table.

Regression

For GCSE the line of best fit was drawn by eye. Engineers still do it this way!

For S1 you have to be more precise and define the line that makes the sum of the squares of the errors least. This is called the regression line. Actually there are two such lines. If you make the sum of the squares of the y errors least then you are define the regression of y on x. It could also be done the other way round in which case you would be finding the regression of x on y.

It turns out that the parameters Syy and Sxy (and Sxx for x on y) are used to define the regression line.

If the line is y = mx + c then m = Sxy/Sxx and c can be found from ym = mxm + c

For the regression of x on y you just swap the x and y symbols.

If the data has been coded then just follow the above and then substitute Y = ay + b and so on at the end.

Remember you can only use the regression line for predictions if within range of the data. Predicting y from x requires the regression of y on x. Similarly predicting x from y requires the regression of x on y.

THAT’S ALL FOLKS

It’s an engineers’ nightmare really

Richard Vincent

17th November 2005