ANOVA and ANCOVA Examples from NKW Text

Stat 502

Homework 7

Assigned 11/15/07

Due 11/27/07

1. Hospital length of stay. The Study on the Efficacy of Nosocomial Infection Control (SENIC) was designed to determine whether infection surveillance and control programs in the U.S. have reduced the rates of nosocomial (hospital-acquired) infection. We are interested here in knowing if medical school affiliation (yes/no), and region of the U.S. are associated with mean length of hospital stay in days. 113 hospitals of the 338 participating in the study were randomly chosen and analyzed.

Data: This data in the file SENIC.dat consists of a random sample of 113 hospitals selected from the original 338 hospitals surveyed.

Each line of the data set has an identification number and provides information on 11 other variables for a single hospital. The data presented here are for the 1975-76 study period. The 12 variables are:

Variable
number / Variable name / Description
1 / Identification number / 1-113
2 / Length of stay / Average length of all patients in
Hospital (in days)
3 / Age / Average age of patients (in years)
4 / Infection risk / Average estimated probability of
Acquiring infection in hospital
(in percent)
5 / Routine culturing ratio / Ratio of number of cultures performed
To number of patients without signs or symptoms of hospital-acquired infection, times 100.
6 / Routine chest X-ray ratio / Ratio of number of X-rays performed to number of patients without signs or symptoms of pneumonia, times 100
7 / Number of beds / Average number of beds in hospital during study period
8 / Medical school affiliation / 1=Yes, 2=No
9 / Region / Geographic region, where: 1=NE, 2=NC, 3=S, 4=W
10 / Average daily census / Average number of patients in hospital per day during study period
11 / Number of nurses / Average number of full-time equivalent registered and licensed practical nurses during study period (number full time plus one half the number part time)
12 / Available facilities and services / Percent of 35 potential facilities and services that are provided by the hospital

For this analysis you will only be concerned with three columns in the provided dataset, the response, mean length of stay (LOS) and two possibly predictive factors: medical school affiliation and geographic region.

(a)Explore the data graphically and choose a scale for the analysis of LOS.

(b)Compute the marginal means for each region, and the marginal means for each level of medical affiliation.

(c)Compute the least-squares means for each of the factors, and describe how they are different from the marginal means.

(d)Perform an additive ANOVA decomposition of the data. Obtain standard errors for the least-squares means for Regions, and obtain p-values for all pairwise comparisons of Regions.

(e)Suppose you were to randomly sample two hospitals, one from Region 1 and one from Region 2. Based on this dataset, what is the expected difference in length of stay between these two hospitals? Note: here you are computing the expected difference without regard for the medical school affiliation, despite the fact that it may belong in the model. Explain your answer.

(f)Suppose you were to randomly sample two hospitals from among those that are not affiliated with medical schools, one from Region 1 and one from Regions 2. Based on this dataset, what is the expected difference in length of stay between these two hospitals? Note: obviously the question is: does the answer here differ from that in part (e). I.e., does comparison between random hospitals for hospitals of a particular medical school affiliation influence the expected difference. Explain your answer.

2. Analysis of covariance. The new cholesterol-lowering supplement, Fibralo, was studied in a double-blind study against the marketed reference supplement, Gemfibrozil, in 34 non-insulin dependent diabetic (NIDDM) patients. One of the study's objectives was to compare the mean decrease in triglyceride levels between groups. The degree of glycemic control, measured by hemoglobin A1c levels (HbA1c), was thought to be an important factor in response to the treatment. This covariate was measured at the start of the study and is provided along with the percent changes in triglycerides from pre-treatment to the end of the 10-week trial in the dataset diabetes.dat.

Note that this problem is quite analogous to the one in the lecture notes. Despite my emphasis on explicitly accounting for a pre-treatment score as a covariate, here we look only at % change in triglycerides. The only covariate is the hemoglobin level, not the pre-treatment triglyceride level.

The ANCOVA model provided in the notes was written

. This can be rewritten with mean centered covariates as. With this in mind the “adjusted means” are defined as

, where is the least squares estimate of the slope. This is the adjustment to the simple mean for the difference between the covariate mean for the ith group and the overall covariate mean.

(a)Plot these data and compute a test of the assumption that the slope “b” is the same for both supplement groups.

(b)Using the fact that the least squares estimate and are independent, write down an expression for the standard error of the adjusted mean and evaluate it.

(c)Similarly, give an expression for the standard error of the difference in adjusted means and use this to compute a 95% confidence interval for this difference. [Note: I want you to given an expression for this standard error from the definition of adjusted means, even though you can obtain the numerical estimate of the standard error of the difference directly from computer output.]

(d)Carry out a test to evaluate the statistical significance of the difference in (adjusted) mean responses between the two supplement treatments. Is there a difference in mean responses between supplements? [Note: whether or not it makes sense to take about THE adjusted mean depends on the decision to accept the hypothesis of equal slopes in part (a).