Module 14 – Inference for Relationships (Continued from Module 13)

New cases in Module 14: C  C and Q  Q

Case of C  C

For two-variable situations of Categorical  Categorical,

we display the data in a two-way table (with rows and columns) – as we did in Module 2.

We need the chi-square test statistic.

•In this context, chi is the Greek capital letter that looks a lot like a “floppy X”. It is

And chi-square is written

Pronunciation guide – important:

• How to pronounce the Greek letter chi: it is pronounced like the beginning of the English word “kite” (so, it is pronounced like the “ki” in “kite”).

• There is another word spelled the same – chi – that means something entirely different – a life force or life energy. That word is pronounced like the “chee” in the English word “cheese”.

•The chi-square test statistic we need here summarizes the differences between:

the observed data counts (in the two way table) and

the expected counts – that would be expected if H0 were true.

To calculate the chi-square statistic, for each cell of the two-way table, the Observed count (the data value) must be known and also the Expected count that would be there if H0 were true. Then for each cell the difference between those two numbers is found, that value is squared, and that result is divided by the Expected count. Then all those numbers are summed. In symbols, that is:

=

Notice:

-When H0 is true, we expect the Observed counts to be similar to the Expected counts. When they are similar, then their differences are small – which means that the numerators of the fractions in the sum are small – which results in the entire sum being small. So, a small value of the test statistic is consistent with H0 being true.

-When the test statistic is large, it is because the fractions added together in the sum is large, and they are large when the differences between Observed and Expected values are large. That happens when the Observed values are not consistent with the Expected values – which indicates that H0 is probably not true.

•The conditions that are required for the chi-square test to be safely used are:

i)The sample should be random

ii)The expected counts must be larger than 5. (This is a conservative version of this requirement. This test may be safe when lesser conditions are met. But this version is easy to check out so we will typically use it.)

Chi-Square Distribution - graphs

There is a whole family of chi-square distributions, one for each number of “degrees of freedom”.

All the chi-square distributions are skewed positively (to the right). For higher degrees of freedom the skew is less.

df means “degrees of freedom”

Vertical axis is in hundredths

From

Here, indicates “degrees of freedom”, often abbreviated df.

From

Example A: Two Categorical Variables (C  C)

It is thought that having a pet may affect the life quality and life length of people with chronic health problems. A researcher got a random sample of people who had a chronic illness and noted whether they had a pet or not, and then noted whether they survived for one year.

a) Was this an experiment? Why or why not?

b)Researchers thought that pet ownership might improve survival. So which is the explanatory variable (pet ownership or survival)?

Suppose the data is this:

Pet Ownership
Patient Status / No / Yes
Survived
(a year) / 28 / 50
Died(within
the year) / 11 / 3

Actual data “from Erika Friedmann et al., “Animal companions and one-year survival of patients after discharge from a coronary care unit,” Public Health Reports, 96 (1980), pp. 307-312” fromDavid S., M., & McCabe, G.P. (1999). Introduction to the Practice of Statistics (3rd ed.). New York: W.H. Freeman and Company, page 646.

c)Compute appropriate percentages – based on the explanatory variable.

What affect does it look like pet ownership might have?

d)We will do a Chi-Square Test for Independence, to test if the two variables are independent of each other or if they have a relationship between them.

H0:

Ha:

e)Use the calculator to do the Chi-Square Test of Independence

f)Make conclusions, explaining why, and then stating it in clear English.

Special situation: two categories for each variable

The case of the relationship of categorical variables C  C, when there are exactly 2 categories for each variable, is often called a comparison of two proportions. In Example A, there were 2 categories for each variable. The two proportions we were interested in were the proportion who survived from those who did not own a pet and the proportion who survived from those who did own a pet.

The hypothesis test for this situation can be done by other methods (not using chi-square but rather the z test statistic), and is then called a Test of two Proportions.

Whichever test is used, the conclusion will be the same. That is – for a situation of CC with 2 categories for each variable, you can use either the chi-square test or else the z test of proportions and get the same conclusion.

Example B: Same situation as Example A but different data values

Suppose the data had been this[this is not actual data, just an example]:

Pet Ownership
Patient Status / No / Yes
Survived
(a year) / 63 / 65
Died(within
the year) / 12 / 8

a)Compute appropriate percentages – based on the explanatory variable.

What affect does it look like pet ownership might have?

b)Use the calculator to do the Chi-Square Test of Independence. What is the test statistic value and the p-value?

c)Make conclusions, explaining why, and then stating it in clear English.

Answers

Example A: Two Categorical Variables (C  C)

It is thought that having a pet may affect the life quality and life length of people with chronic health problems. A researcher got a random sample of people who had a chronic illness and noted whether they had a pet or not, and then noted whether they survived for one year.

a) Was this an experiment? Why or why not? This is not an experiment since there was non intervention. Rather, the people were simply observed (asked about pet ownership and death records observed).

b)Researchers thought that pet ownership might improve survival. So which is the explanatory variable (pet ownership or survival)? Pet ownership is the explanatory variable. It might explain why some people survive longer than others. Survival/death is the response variable.

Pet Ownership
Patient Status / No / Yes
Survived (a year) / 28 28/39 = 71.8% / 50 50/ 53 = 94.3%
Died (within the year) / 1111/39 = 28.2% / 33/53 = 5.7%
39 total
(100% of those not owning pets) / 53 total
(100% of those owning pets)

c) Compute appropriate percentages – based on the explanatory variable.

What affect does it look like pet ownership might have? Since the survival rate of those owning pets (94.3%) is fairly much higher than the survival rate for those not owning pets (71.8%), it seems that pet ownership might be related to longer survival.

d)We will do a Chi-Square Test for Independence, to test if the two variables are independent of each other or if they have a relationship between them.

H0: There is no relationship between pet ownership and survival. That is, the two variables are independent.

Ha: There is a relationship between pet ownership and survival. That is, the two variables are not independent.

e)Use the calculator to do the Chi-Square Test of Independence.

Enter matrix A with first row 28, 50 and second row 11, 3. Then do Test. Check the output matrix of Expected values – all the counts are above 5 so we can use the chi-square test.

Results: test statistic = 8.851 p-value = 0.0029 .003.

f)Make conclusions, explaining why, and then stating it in clear English.

We reject H0 and support Ha: because the p-value of .003 is very small (the p-value is less than alpha). So it is highly unlikely that we’d obtain data like this (or more extreme) if the null hypothesis were true and the variables were not related.

Various wordings for the Conclusion:

- Based on the evidence, there is a relationship between pet ownership and survival from illness.

- The data support that pet ownership is related to survival rates; owning a pet improves survival rates.

- The data is significant in showing that pet ownership improves survival of chronic illness.

Example B: Same situation as Example A but different data values

Pet Ownership
Patient Status / No / Yes
Survived (a year) / 63 63/75 = 84% / 65 65/73 = 89%
Died (within the year) / 12 12/75 = 16% / 8 8/73 = 11%
Total = 75 / Total = 73

a) Compute appropriate percentages – based on the explanatory variable.

What affect does it look like pet ownership might have?The survival rate for those owning pets is higher than for those who do not own pets, but it is not a lot higher (85% v. 84%). So the difference might be significant or might be due to chance.

b)Use the calculator to do the Chi-Square Test of Independence.
Expected Value matrix has all values above 5, so this test may be used.

test statistic value = .804 and the p-value= .36979 .370.

c)Make conclusions, explaining why, and then stating it in clear English.

We do not reject H0 because the p-value of .370 is large (larger than any sensible significance level).

There is not sufficient evidence that there is a relationship between pet ownership and survival from illness. The data is not significant. The data are bit sufficient so show that pet ownership improves survival rates.

TI Calculator instructions for Chi-Square test of independence (case C  C)

from OLI p. 254 “Learn By Doing”

  • Enter the data in Matrix A:
  • Press MATRIX (2ND/x-1 on TI 84; MATRIX button on old TI83)
  • You should see:
  • Press the right arrow twice to choose EDIT.
  • Choose 1:[A].
  • You should see:
  • NOTE: Your screen may be different if another matrix was previously entered in [A].
  • You need to enter: number of rows X number of columns.
  • The rows entry will be highlighted. Enter the number of categories for the explanatory variable.
  • Use the right arrow (or ENTER) to move to the columns entry.
  • Note that the correct number of rows is now displayed.
  • Enter the number of categories for the response variable.
  • Use the down arrow (or ENTER) to move to the entry for the first row, first column.
  • Note that the correct number of columns is now displayed.
  • Enter counts (no totals) from your two-way table.
  • You should see:
  • Choose STAT/TESTS/C: X2-Test.
  • Enter the correct matrix for the observed values (the values you just entered from the two-way table) and any matrix for the expected values calculated by the TI.
  • Note:You can always use the default values of [A] for observed and [B] for expected.
  • To use matrices other than [A] or [B], position the cursor to the right of observed: (or expected) and press MATRIX/NAMES/ choose a matrix name.
  • You should see:
  • Choose Calculate, then ENTER.
  • You should see:
  • Note: degrees of freedom df = (number of rows - 1)(number of columns - 1)
  • To see the expected values calculated by your TI:
  • Choose MATRIX (2ND/x-1 on TI 84; MATRIX button on old TI83)
  • You should see:
  • Press the right arrow twice to choose EDIT.
  • Choose 2:[B].

Case of Q  Q

For two-variable situations of Quantitative  Quantitative

• For Q  Q, we use scatterplot, correlation coefficient, and regression line. You may need to go back to Module 2 and review these topics so that this section will make sense.

• On OLI p. 258 you really should do the “learn by doing” (really – do both of them!). It is probably easiest to choose the directions for using Excel, but of course you can choose any software you like. (If your Excel does not yet have the Data Analysis Toolpak installed, you can use the Excel “?” help feature to get directions for it – it is free).

On OLI p. 259 – do that “learn by doing” – read the answers carefully.

• Very important point made on OLI p. 258:

It is important to distinguish between the information provided by r and by the p-value.

• The correlation coefficient r informs us about the strength of the linear relationship in the data.

r close to +1 or -1 for a strong linear relationship,

r close to 0 for a weak linear relationship,

r close to 0.5 for a moderate linear relationship

• the regression p-value informs us about the strength of evidence that there is a linear relationship in the population from which the data were obtained.

 p-value that is very small (definitely smaller than the significance level) provides strong evidence that there is a linear relationship. A p-value of .0001 or even .001 provides very strong evidence of a linear relationship.

p-value that is near the significance level provides moderately strong evidence that there is a linear relationship. If the significance level is .05 and the p-value is .048, for example, there is moderately strong evidence of a linear relationship.

p-value that is “large” (larger than the significance level) does not provide evidence of a linear relationship.

Examples: What is the evidence for what type of relationship in each of these?

a)If r = .9 and p-value = .01

Strong evidence (from p=.01) of a strong linear relationship (from r = .9)

b)If r = .3 and p-value = .01

Strong evidence (from p=.01) of a moderately weak linear relationship (from r = .3)

c)If r = .9 and p-value = .048

Moderately weak evidence (from p=.048) of a strong linear relationship (from r=.9)

d)If r = .3 and p-value = .048

Moderately weak evidence (from p=.048) of a moderately weak linear relationship (from r=.3)

e)If r = .9 and p-value = .28

The data do not provide evidence of a linear relationship (since p=.28 is a large p). [in this case, it doesn’t matter how strong the linear relationship is since we don’t have evidence that there is a linear relationship.]

f)If r = .1 and p-value = .002

Strong evidence (from p=.002) of a weak linear relationship (from r = .1)

Module 14 – page 1