Stat3503/3602 — Unit1: PartialSolutions 13

1.1.1. “Here is an alternate way to prepare the worksheet. Follow through the steps, cutting and pasting data where appropriate. What menu choices would produce the same results? [Look at the DATA menu.] Explain what each command does. Compare c13 and c14 with c1 and c2.”

Explanations of commands:

MTB > name c11 'Fresh' c12 'Stored' #[Puts names atop c11 and c12 in the Worksheet.]

MTB > set c11 #[Shift the focus to column c11..]

DATA> 10.2, 10.5, 10.3, 10.8, 9.8, 10.6, 10.7, 10.2, 10.0, 10.6 #[Data can be typed or pasted.]

DATA> end #[Data goes into c11 of Worksheet. Remove the focus from column c11.]

MTB > set c12 #[Shift the focus to column c12.]

DATA> 9.8, 9.6, 10.1, 10.2, 10.1, 9.7, 9.5, 9.6, 9.8, 9.9 #[Data can be typed or pasted.]

DATA> end #[Data goes into c11 of Worksheet. Remove the focus from column c12.]

MTB > stack c11 c12 c13; #[Stack the data from column C11 and C12 into C13.]

SUBC> subs c14. #[Put the group designations into column C14.]

The function of each command is shown in brackets to the right in italics. Upon completion of either of these sections, column C13 is identical to column C1 and column C14 is identical to column C2.

Using menus:

DATA ► Unstack Columns

Cursor in field "Unstack the data in"; double-click on c1 in Column List.

Cursor in field "Using subscripts in", double-click on c2 in Column List.

Select radio button "After last column in use"

Select (if not already selected) check box "Name the columns..."

Click OK

Result: Unstacked data are now in columns c15 and c16 with names based on the name of c1.

1.1.2. “In the process of working Problem 1.1.1 you put the data for each group into a separate column (c11 and c12). Data in separate columns are said to be in "unstacked" format. Look at the DATA menu and figure out how the stacked data in c1 can be put into unstacked format using the subscripts in c2. (Use the column names c21 'New' and c22 'Old' for this.) What command/subcommand combination could you use to unstack the data, without the help of the menus?”

Using commands

MTB > name c21 'New' c22 'Old'

MTB > unstack c1 c21 c22;

SUBC> subscripts c2.

Using Menus

DATA ► Stack ► Columns

Cursor in field "Stack the following columns"; double click on c15, then c16, in Column List.

Select radio button "Column of current worksheet:" and type c17; after "Store subscripts in:" type c18.

Select (if not already selected) check box "Use variable names..."

Click OK

Result: Stacked data are now in c17, subscripts (text designations from c15 and 16 labels) are now in c18


1.2.1. [Minitab makes standard (character) and professional (pixel) graphics.] Illustrate both types of graphics..:

The following boxplot and dotplot are created using standard graphics:

MTB > gstd

MTB > boxp c1;

SUBC> by c2.

Boxplot

Group

------

1 ------I + I------

------

------

2 ----I + I----

------

--+------+------+------+------+------+----Potency

9.50 9.75 10.00 10.25 10.50 10.75

MTB > dotp c1;

SUBC> by c2.

Dotplot: Potency by Group

Group

1

. . : . . : . .

-+------+------+------+------+------+-----Potency

Group

2

. : . : . : .

-+------+------+------+------+------+-----Potency

9.50 9.75 10.00 10.25 10.50 10.75

The following boxplot and dotplot are created using professional graphics:

a. “Do the boxplots show the differences between the two groups as clearly as do the dotplots? More clearly? Defend your answer.”

In general, boxplots do not suggest the shape of the distribution of the population from which the sample was chosen as clearly as the dotplots do. However, in this case, the sample sizes are so small that differences between the shapes of the two distributions are not evident even in dot plots. It is easier to see the differences in medians in the boxplots than in the dotplots, since the medians are explicitly shown in the boxplots. That the values of the observations in the two samples overlap is evident in both types of plots.

b. “Look at one of the dotplots above. Can you see exactly how many data points are represented? Now look at one of the boxplots above. Can you see how many data points are represented?”

The number of data points is not illustrated in the boxplots. Each individual point is plotted in these dotplots.

c. “Minitab's boxplots sometimes indicate the presence of outliers. Are outliers indicated for either of our groups?”

There are no outliers in either of the two groups.

d. “What descriptive statistics are used in making box plots?”

Descriptive statistics used in making boxplots are the minimum, the lower quartile, the median, the upper quartile and the maximum. Collectively, these five numbers are called the "five number summary" of a sample.

e. “Comment on the differences between standard-graphics and professional-graphics boxplots.”

You will need to experiment with several different datasets to see all of the differences noted here. The standard boxplot is horizontal and indicates possible and probable outliers with different symbols. The professional box plot is vertical, and does not distinguish between possible and probable outliers (using the same symbol for both). There are slight differences in how the quartiles are computed in standard-graphics boxplots compared with professional-graphics boxplots. These differences are usually obvious only with small samples.

f. “We have given several commands above. What menu choices can be used to produce each style of boxplot?”

Since Release 14 of Minitab, the standard graphics no longer appear in the menus by default. If you wish, you may be able to use Tools ► Customize ► Menus to put the Character Graphs button on the Toolbar (depending on your computer). Then, Character Graphs ► Boxplot may be used to produce a standard boxplot with C1 in the “Variable” text box and C2 in the “By variable” text box.

To produce the professional boxplot, the menu sequence Graph ► Boxplot ► With Groups may be used with C1 in the “Graph variable” text box and C2 in the “Categorical variables for grouping” text box.


1.2.2. In R one prepares vectors for potencies of Stored and Fresh samples, finds descriptive statistics for each group, combines data into a single vector of potencies with a corresponding categorical vector of sample types, and makes stripcharts and boxplots of the data as shown below. Execute the code and show the results, and compare with corresponding results obtained in Minitab. (Note: The function as.factor designates typ as a categorical rather than a numerical variable. In this unit, the distinction in variable types is not always important because typ takes only two values. For some procedures, this distinction becomes crucial if an intended categorical variable takes more than two values.)

> summary(fresh); sd(fresh)

Min. 1st Qu. Median Mean 3rd Qu. Max.

9.80 10.20 10.40 10.37 10.60 10.80

[1] 0.3233505

> summary(stored); sd(stored)

Min. 1st Qu. Median Mean 3rd Qu. Max.

9.500 9.625 9.800 9.830 10.050 10.200

[1] 0.2406011

1.3.1. “For a two-sample design with n = 10 observations in each group and a fixed significance level α = .05, find the critical values for the two-sided pooled t test and the F test discussed above….. Compare your results with tables in your text. Verify that the square of the critical value for t is the critical value for F. In this problem, why do you need to use y = 0.975 for the t distribution and 0.95 for the F distribution? (For each distribution, draw a sketch and shade in the area corresponding to probability 0.05.)”

To find the critical t value for a=.05 for a two sample t test, use the following commands:

MTB > invcdf .975;

SUBC> t 18.

Inverse Cumulative Distribution Function

Student's t distribution with 18 DF

P( X <= x ) x

0.975 2.10092

Note that the argument of the CDF function is .975 (not .95). This is based on 1 – a/2 because the t-test is two-tailed (also called two-sided). The result shows that t = 2.10092 is the critical value for a=.05. (A critical value is a value that separates acceptance and rejection regions.)

From Table 2 in the text, with a=.025 and df =18, the critical value is 2.101. (The sketch of the t distribution is omitted here. It would show 2.5% in each tail of the t-distribution.)

To find the critical value for the F-statistic, the following commands are used:

MTB > invcdf .95;

SUBC> F 1 18.

Inverse Cumulative Distribution Function

F distribution with 1 DF in numerator and 18 DF in denominator

P( X <= x ) x

0.95 4.41387

Note that 1-a is used in the calculation because the F distribution is single tailed. The critical F statistic is calculated as 4.41387 for a=.05. From Table 8 in the book, with a=.05, df1=1 and df2 =18, the F statistic is 4.41. Note that the square root of 4.41 is 2.1 which is the same as the t statistic calculated above.
A plot of the density function of the appropriate F distribution can be made in R as follows:

x = seq(0, 6, length=1000); y = df(x, 1, 18); crit = qf(.95, 1, 18)

plot(x, y, type="l", lwd=2, main="Density of F(1, 18): Color for 5% in Right Tail")
xx = seq(crit, 6, length=500); yy = df(xx, 1, 18)

lines(xx, yy, type="h", col="blue")

abline(h=0, col="green"); abline(v=crit, col="red")

1.3.2. Consider a balanced two-sample design in which each group has n observations. Let the group totals
be T1 and T2, and denote the grand total of all observations as T1 + T2 = G. Express the formulas for both the pooled t-statistic and the F-statistic in terms of this notation. Then use simple algebra to verify that the F-statistic is the square of the t-statistic.

The "denominators" of T2 and F are sp2 = sw2 = MS(Err). The crux of the proof is to show their "numerators" are equal: (n/2)(Y–1· – Y–2·)2 = (T1 – T2)2 / 2n = n[(Y–1·– Y–··)2 + (Y–2·– Y–··)2], where Y–·· = G/2n and the two terms inside brackets on the right side are equal.

1.3.3. Starting with the same four lines of R code as in 1.2.2 (in green below), one can perform the pooled two-sample ttest and the one-way ANOVA as follows. Show the results and compare with the corresponding Minitab results. (The lines in green need not be repeated in a continuous R session.)

In the R code below, the symbol ~ can be considered to mean "by."

> t.test(potency ~ typ, var.equal=T)

Two Sample t-test

data: potency by typ

t = 4.2368, df = 18, p-value = 0.0004959

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

0.2722297 0.8077703

sample estimates:

mean in group 1 mean in group 2

10.37 9.83

> anova(lm(potency ~ typ))

Analysis of Variance Table

Response: potency

Df Sum Sq Mean Sq F value Pr(>F)

typ 1 1.45800 1.45800 17.951 0.0004959 ***

Residuals 18 1.46200 0.08122

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

1.4.1. “In this ANOVA, the (ordinary) residual of an observation is its difference from its group means. Using menus, in the one-way ANOVA procedure select the option to store residuals. Verify the values of the residuals for observations #1, #5, and #11 of the stacked data by hand. Make a box plot of the residuals. Does it indicate any outliers?”

The one way ANOVA table is as follows:

One-way ANOVA: Potency versus Group

Source DF SS MS F P

Group 1 1.4580 1.4580 17.95 0.000

Error 18 1.4620 0.0812

Total 19 2.9200

S = 0.2850 R-Sq = 49.93% R-Sq(adj) = 47.15%

Individual 95% CIs For Mean Based on

Pooled StDev

Level N Mean StDev ----+------+------+------+-----

1 10 10.370 0.323 (------*------)

2 10 9.830 0.241 (------*------)

----+------+------+------+-----

9.75 10.00 10.25 10.50

Pooled StDev = 0.285


The calculated residuals are as follows:

Sample / Group / RESI1
1 / 1 / -0.17
2 / 1 / 0.13
3 / 1 / -0.07
4 / 1 / 0.43
5 / 1 / -0.57
6 / 1 / 0.23
7 / 1 / 0.33
8 / 1 / -0.17
9 / 1 / -0.37
10 / 1 / 0.23
11 / 2 / -0.03
12 / 2 / -0.23
13 / 2 / 0.27
14 / 2 / 0.37
15 / 2 / 0.27
16 / 2 / -0.13
17 / 2 / -0.33
18 / 2 / -0.23
19 / 2 / -0.03
20 / 2 / 0.07

The group 1 mean is 10.37 and the group 2 mean is 9.83. Therefore,

·  residual #1 is 10.2 – 10.37 = –0.17,

·  residual #5 is 9.8 – 10.37 = 0.57,and

·  residual #11 is 9.8 – 9.83 = -0.03,

as verified in the above table.

The boxplot of the 20 residuals is as follows:

Boxplot

------

------I + I------

------

--+------+------+------+------+------+----RESI1

-0.60 -0.40 -0.20 0.00 0.20 0.40

Note that no outliers are indicated on the boxplot.

1.4.2. “Use the menu path STAT ► Basic statistics ► Normality test to test the null hypothesis that the residuals fit a normal distribution (against the alternative that they are not normal). In the resulting normal probability plot, normal residuals should nearly fit a straight line. Do ours? What is the P-value of the Anderson-Darling test of normality?”

The normal probability plot of the residuals is as follows:

The residuals do not follow a straight line very well in the residual range of 0 - .25. However, the Anderson-Darling p value is 0.572, which is greater than .05, indicating that the null hypothesis of normality should not be rejected.

1.4.3. “Test the hypothesis that the two groups come from populations with equal variances against the two-sided alternative. Use the cdf command to find the P-value of this test. (The Fmax-test for t treatment groups is equivalent to the F test if t = 2. Verify this for the Potency data. Tables of the Fmax-distribution are available in Ott/Longnecker, and in some other texts. )