Introductory Concepts

Note that some of the following examples are in Minitab format. XL could also be used. The interpretations are comparable.

Basic Terminology

Observational Study
Investigator is passive
Collect, record data w no intervention
Experimental Study
Investigator is active
Process variables are manipulated
see example on page 3
Enumerative Study
Collect data on a finite, well defined group
Analytical Study
Collect sample data to draw conclusions re the population
Population vs. Sample

Types of Data

Qualitative
Categorical
Quantitative/Numerical

Counts or Measurement

Examples : # of defects, impact strength, length, diameter, voltage, current, weight

Univariate

Engine – hours used

Bivariate

Engine – hours used& repair time

Multivariate

Engine – hours used& repair time & operator experience

Data Structures

Factorial Study

Process variables are studied under different conditions

See example HO

Factors – process variables

Levels – settings of studied variables

Measurement – Variation

Validity – appropriate representation of features under study
Precision – small variation in repeated measures
Accuracy – on average, produces the correct value

Factorial Study

Identify several process variables - Factors.
Collect data for each possible combination of settings of the process variables. Levels are the settings of each variables

Example: Machine Study

Find the percentage of acceptable pellets under each of the 8 conditions. (Response Variable)

Factor 1: Die Volume (Supervised or managed variable)

Level 1:Low

Level 2: High

Factor 2: Material Flow

Level 1: Current method

Level 2: Manuel Filling

Factor 3: Mixture Type

Level 1: No Binding Agent

Level 2: Binding agent

There are 2 x 2 x 2 = 8 conditions under which you can collect data. It is a 2 x 2 x 2 or 23 factorial study.

Condition / Volume / Flow / Mixture
1 / Low / Current /

No Binder

2 / High / Current / No Binder
3 / Low / Manual / No Binder
4 / High / Manual / No Binder
5 / Low / Current / Binder
6 / High / Current / Binder
7 / Low / Manual / Binder
8 / High / Manual / Binder

Measurement

Direct vs. Indirect
Repeatability and Reproducability

Sampling

Judgment
Systematic
Simple Random Sample
Stratified
Cluster

Variables

Response variable – the variable being monitored in an experimental study
Managed variable – investigator chooses settings (manages)
Concomitant (accompanying) variable – observed but not primary or managed
Extraneous variables – may affect outcome

Blocking

Randomization

Observational Studies - Descriptive Statistics

Summarize data to determine and describe important distributional characteristics

Graphical and Tabular Treatment of Quantitative Data

Relationship between data type and graph type

Univariate Data

Dot Diagrams
Stem-and-leaf plots
Frequency Distribution Tables
Histograms

Bivariate Data

Scatter plots
Run Charts

Quantiles

Quantiles (and Measures of Relative Position)

Medians, Quartiles and Quantiles

Boxplots

IQR

The p quantile of a distribution is a number such that:

A fraction of the distribution p% - lies to the left and

A fraction – (1-p) % lies to the right

Definition: For a data set of n values:

p = (I-0.5)/n

Quantile Plots
Q-Q Plots
Normal Probability Plots

Summary Measures

Measures of Center
Measures of Spread
Statistics vs. Parameters
Interpreting measures

Dot Plots of Numerical Data

MTB > DotPlot 'Purities'.

Dotplot: Purities

:: : : . : .

:: .. ::: :. : : : : :: . .

.:: .: :: ::: :: :: ::: :: :: :. :: . . .: . .

---+------+------+------+------+------+---Purities

49.0 56.0 63.0 70.0 77.0 84.0

MTB > print c1

Data Display

Purities

63 61 67 58 55 50 55 56 52 64

73 57 63 81 64 54 57 59 60 68

58 57 67 56 66 60 49 79 60 62

60 49 62 56 69 75 52 56 61 58

66 67 56 55 66 55 69 60 69 70

65 56 73 65 68 59 62 58 62 66

57 60 66 54 64 62 64 64 50 50

72 85 68 58 68 80 60 60 53 49

55 80 64 59 53 73 55 54 60 60

58 50 53 48 78 72 51 60 49 67

MTB >

STEM AND LEAF Plot of Numerical Data

Set-up Time (min)

110 115 115 120 120 120 120 125 125 125 130 130 130 130 130 135 135 135 140 140 140 140 145 145 150

In MTB, put values in one column and use: Graph/Stem-and-leaf

Stem-and-leaf of Time N = 25

Leaf Unit = 1.0

1 11 0

3 11 55

7 12 0000

10 12 555

(5) 13 00000[CG1]

10 13 555 [CG2]

7 14 0000

3 14 55

1 15 0

Histogram

Frequncey Distribution of Completion Times (min)

Time
LL / UL / MP / f / %
107.5 / 112.5 / 110 / 1 / 0.04
112.5 / 117.5 / 115 / 2 / 0.08
117.5 / 122.5 / 120 / 4 / 0.16
122.5 / 127.5 / 125 / 3 / 0.12
127.5 / 132.5 / 130 / 5 / 0.2
132.5 / 137.5 / 135 / 3 / 0.12
137.5 / 142.5 / 140 / 4 / 0.16
142.5 / 147.5 / 145 / 2 / 0.08
147.5 / 152.5 / 150 / 1 / 0.04
25

Example problem: Describing Data Using Minitab

2.44 page 88 (M & B expanded)

Time
2.1 / 3.7 / 3.5 / 1.4 / 2.4 / 1.3
9 / 4.4 / 12.6 / 11.4 / 8.2 / 18
14.7 / 2 / 2.7 / 23.1 / 18 / 5.8
19.2 / 9.6 / 6.6 / 32.3 / 5.6 / 26.7
4.1 / 6.9 / 16.7 / 3.9 / 9.9 / 0.4
7.4 / 18.4 / 4.3 / 7.4 / 1.6
14.1 / 0.2 / 0.2 / 3.3 / 8.2
8.7 / 1 / 8.3 / 6.1 / 1.2
1.6 / 24 / 2.4 / 0.3 / 13.8

MTB > describe c1

Descriptive Statistics

Variable N Mean Median TrMean StDev SE Mean

Time 50 8.37 6.35 7.61 7.67 1.08

Variable Minimum Maximum Q1 Q3

Time 0.20 32.30 2.33 12.83

Note: From the menu use: Graph/Stem and Leaf

MTB > Stem-and-Leaf 'Time'.

Character Stem-and-Leaf Display

Stem-and-leaf of Time N = 50

Leaf Unit = 1.0

10 0 0000111111

19 0 222223333

24 0 44455

(5) 0 66677

21 0 8888999

14 1 1

13 1 23

11 1 44

9 1 6

8 1 8889

4 2

4 2 3

3 2 4

2 2 6

1 2

1 3

1 3 2

MTB >

MTB > Note: From the menu use: Graph/Histogram

MTB > Histogram 'Time';

SUBC> MidPoint;

SUBC> Bar;

SUBC> ScFrame;

SUBC> ScAnnotation.

Some Notes on Interpreting the Analysis

Look at the histogram and the summary statistics to interpret the data.

The data is skewed right. Although the time from onset of this illness to its recurrence varies from 0.2 months to 32.3 months (the range), most cases of the illness reoccur between 0.5 to 9 months (approximately).
The most frequently occurring time (the modal class) is 0.2 to 4.8 months A reoccurrence time of 32 months is unusual and is considered to be an outlier.
The mean time to reoccurrence is 8 months, but because the data is skewed right, the median time of 6.4 months is probably a better value to use as the average. [Note half the cases reoccur in less than 6.4 month and half take longer than 6.4 months to reoccur.]
The standard deviation of reoccurrence times is 7.7 months. This value indicates the average variation of the values from the mean.

Scatter Plot of Bi-variate Data

MTB > Print c1-c3

Data Display

Row Board Before After

1 1 0.514000 0.510000

2 2 0.505000 0.502000

3 3 0.500000 0.493000

4 4 0.490000 0.486000

5 5 0.503000 0.497000

6 6 0.500000 0.494000

7 7 0.510000 0.502000

8 8 0.508000 0.505000

9 9 0.500000 0.488000

10 10 0.511000 0.486000

11 11 0.505000 0.491000

12 12 0.501000 0.498000

Graph/Plot / Stat/Regression/Fitted Line Plot

MTB > nscore c1 c2

MTB > Plot 'Time'*'n scr';

Timen scr

0.2-2.00675

0.3-1.62352

0.4-1.46004

1.0-1.32830

1.2-1.21627

1.3-1.11773

1.4-1.02899

1.6-0.90931

2.0-0.80143

2.1-0.73443

2.4-0.63967

2.7-0.55034

3.3-0.49317

3.5-0.43758

3.7-0.38331

3.9-0.33014

4.1-0.27789

4.3-0.22639

4.4-0.17549

5.6-0.12503

5.8-0.07489

6.1-0.02494

6.60.02494

6.90.07489

7.40.15021

8.20.25206

8.30.33014

8.70.38331

9.00.43758

9.60.49317

9.90.55034

11.40.60936

12.60.67058

13.80.73443

14.10.80143

14.70.87223

16.70.94770

18.01.07231

18.41.21627

19.21.32830

23.11.46004

24.01.62352

26.71.84749

32.32.24333

6360ho1 ( Chapter 1 Notes te comb)1

[CG1]5 is the count of values in the row containing the median

[CG2] 10 is cumulative count of values from the top of the figure down