Trinity College, Dublinintroduction to Statistics

Trinity College, DublinIntroduction to Statistics

Generic Skills ProgrammeComputer Laboratory 8

TrinityCollege, Dublin

Generic Skills Programme

Statistics for Research Students

Laboratory 8:Feedback

1One-sample tests and confidence interval for proportions

1.1Assess target achievement

Heard of product:

Was the target achieved? Summarise the results in terms of estimated percentage achieved, confidence interval and significance test.

No. The estimated percentage achieved was 79%, 95% confidence interval was 76% to 82%. This does not cover 90%. Equivalently, the Z statistic value was -9.15, very highly significant.

Bought product

Was the target achieved? Summarise the results in terms of estimated percentage achieved, confidence interval and significance test.

No. The estimated percentage achieved was 50%, 95% confidence interval was 46% to 54%. This does not cover 60%. Equivalently, the Z statistic value was -5.12, very highly significant.

1.2Assess percentages that heard of product by Region

Make a simple summary of the regional breakdown.

Region / Sample
Size / Heard of
Product / Heard of
Product, %
A / 200 / 164 / 82
B / 150 / 105 / 70
C / 300 / 246 / 82
Total / 650 / 515 / 79

Summarise the test results.

Region / Sample
Size / Heard of
Product, % / Z / Confidence Interval
A / 200 / 82 / –3.77 / 77 / to / 87
B / 150 / 70 / –8.16 / 63 / to / 77
C / 300 / 82 / –4.62 / 78 / to / 86
Total / 650 / 79 / –9.15 / 76 / to / 82

Compare the confidence interval widths, including that for the complete sample. Explain the differences in width.

B is widest (14), A is next (10), C is next (8), Total is narrowest (6).

These are in order of sample sizes which influences denominators of the standard errors.

Compare the sample proportions for Regions A and C, compare their z-values, explain.

Proportions are the same, so deviations from 90 are the same, that is, numerators of Z statistics are the same. Z value for Region A is smaller because denominator of Z is smaller because sample size, denominator of standard error, is larger.

1.3Graphical display

2Chi-Square test of homogeneity of proportions

2.1Testing the homogeneity of regional differences

Tabulated statistics: Region, Bought

Rows: Region Columns: Bought

N Y All

A 91 109 200

99.7 100.3 200.0

B 101 49 150

74.8 75.2 150.0

C 132 168 300

149.5 150.5 300.0

All 324 326 650

324.0 326.0 650.0

Cell Contents: Count

Expected count

Pearson Chi-Square = 23.961, DF = 2, P-Value = 0.000

Report on the statistical significance of the results; focus on Pearson Chi-Square.

2 = 23.96 > 22,0.05 = 5.99. p-value < 0.0005.

The result is highly statistically significant.

Note:The Pearson Chi-Square is the commonly used approach based on the test statistic of the generic form . The Likelihood Ratio Chi-Square reported along with the Pearson Ch-Square is an alterntative which is approximately equivalent to the Pearson Chi-Square analysis, the approximation improving with increasing sample size. As the use of the Chi-Square frequency distribution for calculating critical values, p-values etc. is valid only for large samples, the use of the Likelihood Ratio method is redundant and is ignored here.

Check that the Expected Buy frequencies are those shown in the Y column.

x 200 = 100.308

x 150 = 75.231

x 300 = 150.462

Check that the Expected frequencies in each row add to the corresponding row sample size.

99.7 + 100.3 = 200

74.8 + 75.2 = 150

149.5 + 150.5 = 300

Check that the Expected frequencies in each column add to the corresponding column total.

99.7 + 74.8 + 149.5 = 324

100.3 + 75.2 + 150.5 = 326

Hence, explain the number of degrees of freedom associated with Chi-Square.

The calculation of Chi-Square involves the deviations of observed frequencies from expected frequencies. (The expected frequencies are, effectively, fitted values corresponding to the null hypothesis model. Thus, Observed – Expected correspond to Residuals). The degrees of freedom apply to these deviations.

In each row, Observed – Expected sum to 0, therefore, the deviations corresponding to the second column are determined by the deviations corresponding to first column. Since the latter sum to 0 these, the first two deviations determine the third. Hence, all six deviations are determined by these two, (or, in fact, by any two), so there are 2 degrees of freedom.

More generally, this argument demonstrates that the deviations corresponding to any one row are determined by the deviation corresponding to the remaining rows and the same for columns. Hence, with r rows and c columns, there arte r – 1 "free" rows and c – 1 "free" columns, giving (r – 1) x (c – 1) "free" deviations.

3Two-sample tests of proportions

3.1A two-sample test of regional differences

Test and CI for Two Proportions: Hear?, Region2

Event = Y

Region2 X N Sample p

AC 410 500 0.820000

B 105 150 0.700000

Difference = p (AC) - p (B)

Estimate for difference: 0.12

95% CI for difference: (0.0393028, 0.200697)

Test for difference = 0 (vs not = 0): Z = 3.18 P-Value = 0.001

Test and CI for Two Proportions: Bought, Region2

Event = Y

Region2 X N Sample p

AC 277 500 0.554000

B 49 150 0.326667

Difference = p (AC) - p (B)

Estimate for difference: 0.227333

95% CI for difference: (0.140550, 0.314117)

Test for difference = 0 (vs not = 0): Z = 4.88 P-Value = 0.000

Make a report of the test results.

The difference in percentages that heard of the product (bought the product) between Regions A and C and Region B is estimated to be 12% (23%); 95% confidence interval is 4% to 20%, (14% to 31%). In each case, the interval does not cover 0 and so the difference is statistically significant. Equivalently, the value of the Z statistic for testing the hypothesis of no difference between the percentages that heard of the product (bought the product) in Regions A and C and Region B is 3.18 (4.88). In each case, the calculated Z value exceeds the critical value of 2 (or 1.96), so that the result is statistically significant. Equivalently, in each case the corresponding p-value is 0.001 (<0.0005), smaller than 0.05, so that the result is statistically significant.

3.2A Chi-Square two-sample test

Tabulated statistics: Region2, Hear?

Rows: Region2 Columns: Hear?

N Y All

AC 90 410 500

B 45 105 150

All 135 515 650

Cell Contents: Count

Pearson Chi-Square = 10.097, DF = 1, P-Value = 0.001

Demonstrate the equivalence of the 2-sample Z-test and the Pearson Chi-Square test (calculate the square root of the latter).

= 3.1778 = Z. P-value = 0.001 as before.

Identify the sample proportions of the 2-sample test with relevant entries in the 2x2 table.

pAC= 0.82 = 410 / 500

pB= 0.7 = 105 / 150

Explain the Chi-Square DF.

Once any single deviation of observed – expected is determined, the other three can be determined by subtraction from 0. More explicitly, calculating the expected frequencies as

x 500 = 103.8 / x 500 = 396.2
x 150 = 31.2 / x 150 = 118.8

leads to calculation of deviations of observed from expected as

90 – 103.8 = – 13.8 / 410 – 396.2 = 13.8
45 – 31.2 = 13.8 / 105 – 118.8 = – 13.8

showing just one value, 13.8, for the deviation.

The general formula(r – 1) x (c – 1) with r = 1 and c = 1 evaluates to 1.

4Assessing homogeneity of patterns of proportions

Tabulated statistics: Region, Level

Rows: Region Columns: Level

B H N All

A 54.50 27.50 18.00 100.00

B 32.67 37.33 30.00 100.00

C 56.00 26.00 18.00 100.00

All 50.15 29.08 20.77 100.00

Cell Contents: % of Row

Summarise the variation between regional penetration patterns.

Regions A and C have almost identical penetration patterns, with over half (55%) of the respondents having bought the product, over a quarter (27%) having heard of the product but did not buy while 18% had never heard of the product.

By contrast, just one third of respondents in Region B had bought the product, slightly more than a third (37%) had heard of the product but did not buy and slightly less than a third (30%) had never heard of the product.

4.1Graphical display

Discuss the variation patterns.

Regions A and C have almost identical profiles with Region B having much lower Buying and much higher Never Heard and Heard but did not buy.

4.2Chi-Square test

Tabulated statistics: Region, Level

Rows: Region Columns: Level

B H N All

A 109 55 36 200

B 49 56 45 150

C 168 78 54 300

All 326 189 135 650

Cell Contents: Count

Pearson Chi-Square = 24.608, DF = 4, P-Value = 0.000

Confirm the degrees of freedom for Chi-Square; explain. Calculate the 5% critical value. Report on the result of the Pearson Chi-Square test.

There are 3 rows and 3 columns. The deviations corresponding to one of each are determined by the remaining deviations, making up two rows and two columns, that is, four entries, accounting for 4 degrees of freedom. Equivalently, (r – 1) x (c – 1) = 2 x 2 = 4.

From the Calc menu, = 9.5.

The value of Chi-Square is 24.6, with 4 degrees of freedom. This exceeds the 5% critical value and so the result is statistically significant. Equivalently, the p-value is less than 0.0005 < 0.05. This is illustrated below.

page 1