Introductory Concepts
Note that some of the following examples are in Minitab format. XL could also be used. The interpretations are comparable.
Basic Terminology
- Observational Study
- Investigator is passive
- Collect, record data w no intervention
- Experimental Study
- Investigator is active
- Process variables are manipulated
- see example on page 3
- Enumerative Study
- Collect data on a finite, well defined group
- Analytical Study
- Collect sample data to draw conclusions re the population
- Population vs. Sample
Types of Data
- Qualitative
- Categorical
- Quantitative/Numerical
Counts or Measurement
Examples : # of defects, impact strength, length, diameter, voltage, current, weight
- Univariate
Engine – hours used
- Bivariate
Engine – hours used& repair time
- Multivariate
Engine – hours used& repair time & operator experience
Data Structures
- Factorial Study
Process variables are studied under different conditions
See example HO
Factors – process variables
Levels – settings of studied variables
Measurement – Variation
- Validity – appropriate representation of features under study
- Precision – small variation in repeated measures
- Accuracy – on average, produces the correct value
Factorial Study
- Identify several process variables - Factors.
- Collect data for each possible combination of settings of the process variables. Levels are the settings of each variables
Example: Machine Study
Find the percentage of acceptable pellets under each of the 8 conditions. (Response Variable)
Factor 1: Die Volume (Supervised or managed variable)
Level 1:Low
Level 2: High
Factor 2: Material Flow
Level 1: Current method
Level 2: Manuel Filling
Factor 3: Mixture Type
Level 1: No Binding Agent
Level 2: Binding agent
There are 2 x 2 x 2 = 8 conditions under which you can collect data. It is a 2 x 2 x 2 or 23 factorial study.
Condition / Volume / Flow / Mixture1 / Low / Current /
No Binder
2 / High / Current / No Binder3 / Low / Manual / No Binder
4 / High / Manual / No Binder
5 / Low / Current / Binder
6 / High / Current / Binder
7 / Low / Manual / Binder
8 / High / Manual / Binder
Measurement
- Direct vs. Indirect
- Repeatability and Reproducability
Sampling
- Judgment
- Systematic
- Simple Random Sample
- Stratified
- Cluster
Variables
- Response variable – the variable being monitored in an experimental study
- Managed variable – investigator chooses settings (manages)
- Concomitant (accompanying) variable – observed but not primary or managed
- Extraneous variables – may affect outcome
Blocking
Randomization
Observational Studies - Descriptive Statistics
Summarize data to determine and describe important distributional characteristics
Graphical and Tabular Treatment of Quantitative Data
Relationship between data type and graph type
Univariate Data
- Dot Diagrams
- Stem-and-leaf plots
- Frequency Distribution Tables
- Histograms
Bivariate Data
- Scatter plots
- Run Charts
Quantiles
Quantiles (and Measures of Relative Position)
- Medians, Quartiles and Quantiles
Boxplots
IQR
- The p quantile of a distribution is a number such that:
A fraction of the distribution p% - lies to the left and
A fraction – (1-p) % lies to the right
- Definition: For a data set of n values:
p = (I-0.5)/n
- Quantile Plots
- Q-Q Plots
- Normal Probability Plots
Summary Measures
- Measures of Center
- Measures of Spread
- Statistics vs. Parameters
- Interpreting measures
Dot Plots of Numerical Data
MTB > DotPlot 'Purities'.
Dotplot: Purities
.
:
:
:: : : . : .
:: .. ::: :. : : : : :: . .
.:: .: :: ::: :: :: ::: :: :: :. :: . . .: . .
---+------+------+------+------+------+---Purities
49.0 56.0 63.0 70.0 77.0 84.0
MTB > print c1
Data Display
Purities
63 61 67 58 55 50 55 56 52 64
73 57 63 81 64 54 57 59 60 68
58 57 67 56 66 60 49 79 60 62
60 49 62 56 69 75 52 56 61 58
66 67 56 55 66 55 69 60 69 70
65 56 73 65 68 59 62 58 62 66
57 60 66 54 64 62 64 64 50 50
72 85 68 58 68 80 60 60 53 49
55 80 64 59 53 73 55 54 60 60
58 50 53 48 78 72 51 60 49 67
MTB >
STEM AND LEAF Plot of Numerical Data
Set-up Time (min)
110 115 115 120 120 120 120 125 125 125 130 130 130 130 130 135 135 135 140 140 140 140 145 145 150
In MTB, put values in one column and use: Graph/Stem-and-leaf
Stem-and-leaf of Time N = 25
Leaf Unit = 1.0
1 11 0
3 11 55
7 12 0000
10 12 555
(5) 13 00000[CG1]
10 13 555 [CG2]
7 14 0000
3 14 55
1 15 0
Histogram
Frequncey Distribution of Completion Times (min)
TimeLL / UL / MP / f / %
107.5 / 112.5 / 110 / 1 / 0.04
112.5 / 117.5 / 115 / 2 / 0.08
117.5 / 122.5 / 120 / 4 / 0.16
122.5 / 127.5 / 125 / 3 / 0.12
127.5 / 132.5 / 130 / 5 / 0.2
132.5 / 137.5 / 135 / 3 / 0.12
137.5 / 142.5 / 140 / 4 / 0.16
142.5 / 147.5 / 145 / 2 / 0.08
147.5 / 152.5 / 150 / 1 / 0.04
25
Example problem: Describing Data Using Minitab
2.44 page 88 (M & B expanded)
Time2.1 / 3.7 / 3.5 / 1.4 / 2.4 / 1.3
9 / 4.4 / 12.6 / 11.4 / 8.2 / 18
14.7 / 2 / 2.7 / 23.1 / 18 / 5.8
19.2 / 9.6 / 6.6 / 32.3 / 5.6 / 26.7
4.1 / 6.9 / 16.7 / 3.9 / 9.9 / 0.4
7.4 / 18.4 / 4.3 / 7.4 / 1.6
14.1 / 0.2 / 0.2 / 3.3 / 8.2
8.7 / 1 / 8.3 / 6.1 / 1.2
1.6 / 24 / 2.4 / 0.3 / 13.8
MTB > describe c1
Descriptive Statistics
Variable N Mean Median TrMean StDev SE Mean
Time 50 8.37 6.35 7.61 7.67 1.08
Variable Minimum Maximum Q1 Q3
Time 0.20 32.30 2.33 12.83
Note: From the menu use: Graph/Stem and Leaf
MTB > Stem-and-Leaf 'Time'.
Character Stem-and-Leaf Display
Stem-and-leaf of Time N = 50
Leaf Unit = 1.0
10 0 0000111111
19 0 222223333
24 0 44455
(5) 0 66677
21 0 8888999
14 1 1
13 1 23
11 1 44
9 1 6
8 1 8889
4 2
4 2 3
3 2 4
2 2 6
1 2
1 3
1 3 2
MTB >
MTB > Note: From the menu use: Graph/Histogram
MTB > Histogram 'Time';
SUBC> MidPoint;
SUBC> Bar;
SUBC> ScFrame;
SUBC> ScAnnotation.
Some Notes on Interpreting the Analysis
Look at the histogram and the summary statistics to interpret the data.
- The data is skewed right. Although the time from onset of this illness to its recurrence varies from 0.2 months to 32.3 months (the range), most cases of the illness reoccur between 0.5 to 9 months (approximately).
- The most frequently occurring time (the modal class) is 0.2 to 4.8 months A reoccurrence time of 32 months is unusual and is considered to be an outlier.
- The mean time to reoccurrence is 8 months, but because the data is skewed right, the median time of 6.4 months is probably a better value to use as the average. [Note half the cases reoccur in less than 6.4 month and half take longer than 6.4 months to reoccur.]
- The standard deviation of reoccurrence times is 7.7 months. This value indicates the average variation of the values from the mean.
Scatter Plot of Bi-variate Data
MTB > Print c1-c3
Data Display
Row Board Before After
1 1 0.514000 0.510000
2 2 0.505000 0.502000
3 3 0.500000 0.493000
4 4 0.490000 0.486000
5 5 0.503000 0.497000
6 6 0.500000 0.494000
7 7 0.510000 0.502000
8 8 0.508000 0.505000
9 9 0.500000 0.488000
10 10 0.511000 0.486000
11 11 0.505000 0.491000
12 12 0.501000 0.498000
Graph/Plot / Stat/Regression/Fitted Line PlotMTB > nscore c1 c2
MTB > Plot 'Time'*'n scr';
Timen scr
0.2-2.00675
0.2-2.00675
0.3-1.62352
0.4-1.46004
1.0-1.32830
1.2-1.21627
1.3-1.11773
1.4-1.02899
1.6-0.90931
1.6-0.90931
2.0-0.80143
2.1-0.73443
2.4-0.63967
2.4-0.63967
2.7-0.55034
3.3-0.49317
3.5-0.43758
3.7-0.38331
3.9-0.33014
4.1-0.27789
4.3-0.22639
4.4-0.17549
5.6-0.12503
5.8-0.07489
6.1-0.02494
6.60.02494
6.90.07489
7.40.15021
7.40.15021
8.20.25206
8.20.25206
8.30.33014
8.70.38331
9.00.43758
9.60.49317
9.90.55034
11.40.60936
12.60.67058
13.80.73443
14.10.80143
14.70.87223
16.70.94770
18.01.07231
18.01.07231
18.41.21627
19.21.32830
23.11.46004
24.01.62352
26.71.84749
32.32.24333
6360ho1 ( Chapter 1 Notes te comb)1
[CG1]5 is the count of values in the row containing the median
[CG2] 10 is cumulative count of values from the top of the figure down