Chapter 1: Introduction
1.1 What is Statistics?
Statistics involves collecting, analysing, presenting and interpreting data.
We frequently see statistical tools (such as bar charts, tables, plots of data, averages and percentages) on TV, in newspapers and in magazines. Such methods used to organise and summarise data, so as to increase the understanding of the data, are called descriptive statistics.
Statistics is also used in practice in many different walks of life, going beyond simple data summarisation to answer a wide variety of questions such as:
· Medicine: Does a certain new drug prolong life for AIDS sufferers?
· Science: Is global warming really happening?
· Education: Are GCSE and A level examinations standards declining?
· Psychology: Is the national lottery making us a nation of compulsive gamblers?
· Sociology: Is the gap between rich and poor widening in Britain?
· Business: Do Persil adverts really make us want to buy Persil?
· Finance: What will interest rates be in 6 months time?
1.2 Populations and Samples
Suppose that we wanted to investigate whether smoking during pregnancy leads to lower birth weight of babies. We use this example to illustrate the following definitions.
Definitions:
· Experimental unit: the object on which measurements are made.
For above example, we are measuring birth weights of newborn babies, so a unit is a newborn baby.
· Variable: a measurable characteristic of a unit.
For above example, the variable is birth weight.
· Population: the set of all units about which information is required.
For above example, the population is all newborn babies.
· Sample: a subset of units of the population for which we can observe the variable of interest.
For above example, a sample would be the observed birth weights for a set of newborn babies (which will be a subset of all newborn babies).
· Random sample: a sample such that each unit in the population has the same chance of being chosen independently of whether or not any other unit is chosen.
To determine whether smoking during pregnancy leads to lower birth weight of babies, we would compare a random sample of weights of new-born babies whose mothers smoke, with a random sample of weights of new-born babies of non-smoking mothers. By analysing the sample data, we would hope to be able to draw conclusions about the effects on birth weight of smoking during pregnancy for all babies (i.e. the population). The process of using a random sample to draw conclusions about a population is called statistical inference.
If we do not have a random sample, then sampling bias can invalidate our statistical results. For example, birth weights of twins are generally lower than the weights of babies born alone. So if all the non-smoking mothers in the sample were giving birth to twins, whereas all the smoking mothers were giving birth to single babies, then the conclusions we draw about the effects of smoking in pregnancy will not necessarily be correct as they are affected by sampling bias.
Different units of the same population will have different values of the same variable - this is called natural variation. For example, obviously the weights of all newborn babies are not the same. So different samples will contain different data- called sampling variability. Therefore it is important to bear in mind that slightly different conclusions could be reached from different samples.
1.3 Types of Data
Different types of data require different types of analysis. The type of data set is determined by several factors:
· Type of variable:
Ø quantitative data - i.e. numerical (e.g., heights of students, number of phone calls in an hour).
Ø qualitative data - i.e. non-numerical (for example, eye colour, M/F).
Quantitative data can be subdivided further:
Ø discrete – a discrete variable can take only particular values (e.g., number of phone calls received at an exchange).
Ø continuous- a continuous variable can take any value in a given range (e.g., heights of students).
· Number of variables measured:
Ø 1 variable ® univariate data.
Ø 2 variables ® bivariate data. E.g., we may have both the heights and weights of a set of individuals. The data set then consists of pairs of observations on each unit such as (1.7m, 65kg).
Ø 3 or more variables ® multivariate data. E.g., we have heights, weights, eye colour, gender for a group of individuals. In this case the data set consists of sets of 4 observations made on each unit such as (1.7m, 65kg, blue, M).
· Number of samples: For example, when investigating the effects of smoking during pregnancy, we would observe two samples:
Ø a sample of birth weights of babies born to smoking mothers
Ø a sample of birth weights of babies born to non-smoking mothers.
· Relationship of samples (if more than 1 sample):
Ø Are the samples independent? E.g., the two birth weight samples should be independent.
Ø Are the samples dependent?
v Example:
Suppose that a doctor would like to assess the effectiveness of changing to a low-fat diet in lowering cholesterol for a group of patients. To do this the doctor might measure the cholesterol of the patients before starting on the low-fat diet and then measure the cholesterol for the same patients after they have been on the low-fat diet. We therefore have 2 samples of measured cholesterol:
· a sample before the diet
· a sample after the diet.
However, the 2 samples are not independent, since the cholesterol measurements for each sample were taken on the same patients. Samples of this type are called matched pair data.
1.4 Recommended Books
You will need to use statistical tables for the course. The tables used in the exams are:
· Lindley, D.V. and Scott, W.F., New Cambridge Elementary Statistical Tables, C.U.P., 1984.
Statistical tables will be used throughout this course.
There are many books which cover the material in this course. Some good books are:
· Introduction to probability and statistics for engineers and scientists; [with CD-ROM] / Sheldon M. Ross
· Probability and Statistics for Engineers and Scientists - 7th edition, R.E.Walpole, R.H.Myers, S.L.Myres and K. Ye, Prentice Hall, 2002
· Clarke, G.M., and Cooke, D. A Basic Course in Statistics, Edward Arnold, 4th edition, 1999.
· Daly, F., Hand, D.J., Jones, M.C., Lunn, A.D. and McConway, K.J. Elements of Statistics, Open University, 1995.
Goes beyond what's required for this course, but is quite clearly written with some real examples.
· Devore, J and Peck, R. Introductory Statistics, West, 1990.
Rather simplistic at times, but has lots of real examples. Especially good if you have not done any statistics before.
· Spiegel, M.R., Probability and Statistics, Schaum Outline Series, 1988.
In addition, you could browse in the library around QA276 and find a book which suits you. For starters you could try looking at some of the following.
· Anderson, D.R., Sweeney, D.J. and Williams, T.A. Introduction to Statistics: Concepts and Applications, West, 2nd edition, 1991.
· Bassett, E.E., Bremner, J.M., Jolliffe, I.T., Jones, B., Morgan, B.J.T. and North, P.M., Statistics: Problems and Solutions, Edward Arnold, 1986.
· Moore, D.S., The Basic Practice of Statistics, Freeman, 1995.
· Moore, D.S., Think and Explain with Statistics, Addison-Wesley, 1986.
· Moore, D.S., Statistics: Concepts and Controversies, Freeman, 1991, 1985, 1979.
There are many online books which could be useful. See for example
http://www.statsoft.com/textbook/stathome.html
Chapter 2: Graphical and Numerical Statistics
2.1 Histograms
Histograms give a visual representation of continuous data. We consider two separate cases corresponding to when (i) all the bars in the histogram have the same width; (ii) the intervals are of variable widths.
2.1.1 Histograms with equal class widths
v Example:
Mercury contamination can be particularly high in certain types of fish. The mercury content (ppm) on the hair of 40 fishermen in a region thought to be particularly vulnerable are given below (From paper “Mercury content of commercially imported fish of the Seychelles, and hair mercury levels of a selected part of the population.” Environ. Research, (1983), 305-312.)
13.26 / 32.43 / 18.10 / 58.23 / 64.00 / 68.20 / 35.35 / 33.92 / 23.94 / 18.2822.05 / 39.14 / 31.43 / 18.51 / 21.03 / 5.50 / 6.96 / 5.19 / 28.66 / 26.29
13.89 / 25.87 / 9.84 / 26.88 / 16.81 / 38.65 / 19.23 / 21.82 / 31.58 / 30.13
42.42 / 16.51 / 21.16 / 32.97 / 9.84 / 10.64 / 29.56 / 40.69 / 12.86 / 13.80
v The first step is to group the data. A reasonable choice of class intervals is:
0-10, 10-20, 20-30, 30-40, 40-50, 50-60, 60-70.
The frequency table that results from the use of these intervals is:
Interval / Frequency0-10 / 5
10-20 / 11
20-30 / 10
30-40 / 9
40-50 / 2
50-60 / 1
60-70 / 2
To construct the histogram in this situation (i.e. all class widths equal):
· Mark boundaries of the class intervals on the horizontal axis.
· The height of the bars above each interval can be taken as the frequency for that interval.
Instead of using frequencies to give the heights of the rectangles in a histogram, relative frequencies may be used. The relative frequency for an interval is that interval's frequency divided by the total frequency.
v So for the mercury example…
Interval / Frequency / Relative frequency0-10 / 5 / .125
10-20 / 11 / .275
20-30 / 10 / .250
30-40 / 9 / .225
40-50 / 2 / .050
50-60 / 1 / .025
60-70 / 2 / .050
Total / 40 / 1
The relative frequencies can be expressed as percentages (which is how Minitab produces a relative frequency histogram):
Notice that the shape of the histograms, whether using frequencies or relative frequencies, is the same.
2.1.2 Histograms with unequal class widths
There is no hard and fast rule as to how many intervals should be used. Too many classes produce an uneven distribution, but having too few loses information. Usually the number of classes is about 6-20. The more observations we have, the more classes we will usually use.
The width of the intervals defining the histograms need not all be equal. It is often sensible to choose short intervals where the data is quite dense but intervals with a longer width where the data is more sparse. This will ensure that we don’t have too many intervals with zero frequency, yet keeps as much information about the distributional shape of the data as possible.
When unequal interval widths are used, then the frequency density should be used on the vertical scale on the histogram, where
Frequency density = Frequency ¸ class width.
v Example:
The lengths (in metres) of 250 vehicles aboard a cross-channel ferry are summarised in the following table:
Vehicle length (m) / Class width / Frequency / Frequency density3.0-4.0 / 1 / 90 / 90
4.0-4.5 / 0.5 / 80 / 160
4.5-5.0 / 0.5 / 40 / 80
5.0-5.5 / 0.5 / 24 / 48
5.5-7.5 / 2 / 16 / 8
Notice that if we had simply defined the heights of the rectangles to be the frequencies, then the histogram would exaggerate, for example, the incidence of cars between 3 and 4 metres in length.
An alternative way of producing a histogram in situations were not all class widths are equal is to set the bar height to be the relative frequency density. This is given by:
Relative freq. density = Relative freq. ¸ class width.
If the histogram is produced in this way, then the total area of all the bars is 1.
v Example (continued)
The relative frequency densities for the car vehicle length data are as follows:
Vehicle length (m) / Class width / Frequency / Relative freq. / Rel. freq. density3.0-4.0 / 1 / 90 / 0.36 / 0.36
4.0-4.5 / 0.5 / 80 / 0.32 / 0.64
4.5-5.0 / 0.5 / 40 / 0.16 / 0.32
5.0-5.5 / 0.5 / 24 / 0.096 / 0.192
5.5-7.5 / 2 / 16 / 0.064 / 0.032
The corresponding histogram can then be produced:
2.1.3 Histogram shapes
Histograms are very useful for giving some idea of the shape of a density by approximating the histogram to a smooth curve.
Densities can take many different shapes:
Unimodal Bimodal Multimodal
Symmetric Positive skew Negative skew
Normal Heavy-tailed Light-tailed
2.1.4 Histograms for discrete data
Discrete data is usually illustrated using a bar-line chart (or a bar chart), whilst histograms are generally used for continuous data. However, when the number of possible values for the observations is large, a bar diagram would become uninformative. In this case it is acceptable to group the values into class intervals, much as you would for continuous data.
v Example:
Suppose we have the following data:
1 / 1 / 2 / 2 / 2 / 3 / 3 / 4 / 4 / 5 / 5 / 5 / 5 / 6 / 6 / 7 / 7 / 78 / 9 / 9 / 9 / 9 / 10 / 10 / 10 / 10 / 10 / 11 / 11 / 11 / 11 / 12 / 12 / 12 / 12
13 / 13 / 13 / 13 / 14 / 14 / 14 / 14 / 14 / 14 / 15 / 15 / 15 / 15 / 15 / 16 / 16 / 16
17 / 17 / 17 / 18 / 18 / 19 / 19 / 20 / 21 / 21 / 22 / 22 / 23 / 23 / 24 / 26 / 27 / 29
As there are a large number of different values here, to get a better idea of the shape of the distribution, we can group data into classes. Let's consider grouping all observations between 1 - 3, 4 - 6 and so on. To draw a histogram we need a continuous scale and so we need to define our histogram intervals to be 0.5 - 3.5, 3.5 - 6.5, and so on. (Remember: a histogram never has gaps between the bars).
We then get the following frequency distribution:
Interval / Frequency0.5 - 3.5 / 7
3.5 - 5.5 / 8
5.5 - 9.5 / 8
9.5 - 12.5 / 13
12.5 - 15.5 / 14
15.5 - 18.5 / 8
18.5 - 21.5 / 5
21.5 - 24.5 / 5
24.5 - 27.5 / 2
27.5 - 30.5 / 1
The histogram can now be drawn in the normal way.