Final Examination for MCBS Biostatistics II
July 2007
Name:______
This examination is open book. You are allowed to review any course material. Download an electronic copy of this examination from the class web-site Please type your answerdirectly under each corresponding question. Answers must be written in words, SPSS tables will not be graded. Please provide SPSS figures if appropriate. When you finish, save your file with your name, and e-mail it to me at . I will notify you upon the reception of your file, and you may dismiss from the class.
Part I: Selecting statistical tests.
1. A list of univariate (or bivariate) tests and regression models is provided in the page 2, please select the most appropriate univariate (or bivariate) and a regression model for each of the following scenario (for choice of regression, consider type of outcome in each scenario). When you have more than 1 answer, you may pick one.
1.1. Consider a study assessing the relationship between serum cholesterol level and risk of myocardial infarction in a sample of 200 physicians. The most appropriate statistical method is:
Bivariate test ______Regression model ______
1.2. Consider a study assessing systolic blood pressure (assumed normally distributed) between 100 coffee drinkers and 100 non-coffee drinkers. We want to assess the effect of coffee drinking on systolic blood pressure.
Bivariate test ______Regression model ______
1.3. Consider a study assessing the association between coffee consumption and alcohol consumption, both of which are measured as number of beverages per day and are heavily skewed to the right. We want to assess the association between these two variables:
Bivariate test ______Regression model ______
1.4. Consider a study comparing time-to-death from lung cancer due to smoking in which 100 smokers and 100 nonsmokers are followed for 2 years. The most appropriate hypothesis test is:
Bivariate test ______Regression model ______
1.5. Consider a randomized clinical trial of patients with type 2 diabetes assessing the association between HbA1c level and an intensive management from clinical pharmacists. Control group receives a standard care. HbA1c level is measured at baseline, 6 months and 12 months after enrollment. We want to assess the decline in HbA1c between the intervention and control groups using all 3 measurements (baseline 6 and 12 months).
Bivariate test ______Regression model ______
Choices for univariate/bivariate test
A. Pearson chi-square test
B. Two-sample t-test
C. Log-rank test
D. Spearman rank correlation coefficient
F. Paired t-test
G. McNemar’s test
H. Kappa statistic / McMemar’s test
I. Mann-Whitney U test
J. Kruskal-Wallis test.
K. Wilcoxon Signed Rank test
M. Pearson correlation coefficient
N. Repeated measured ANOVA
O. Analysis of Covariance (ANCOVA)
P. One-way ANOVA
Q. Two-way ANOVA
R. Not Applicable
Choice of regressions:
A. Linear mixed effect model
B. GEE
C. Linear regression.
D. Proportional odds logistic regression.
F. Conditional logistic regression / GEE
G. Binary logistic regression
H. Cox proportional hazard regression
I. Linear regression with transformation of variables
L. Poisson regression
M. Not Applicable
Part II: Data Analysis
1. SUPPORT800.sav contains randomly selected 800 critically ill patients from SUPPORT study {KnausWA, Annals of Internal Medicine (1995) 122:191-203.} We want to identify factors influencing total hospital cost (totcst).
1.1 We want to assess the association between total hospital cost and serum albumin at day 3 (variable: alb). Assess regression residuals to validate model assumptions for linear regression model with the original (non-transformed) total hospital cost. If the assumptions are violated, attempt appropriate transformation for total hospital cost and re-assess regression residuals.
1.2What’s R2 for your final regression model, explain it in biological terms. Provide scatterplots with a regression slope and its 95% confidence interval from your final model. Provide your conclusion on the association between total hospital cost and serum albumin level at day 3 in biological terms.
- Titanic1000.sav contains randomly selected 1000 passengers’ records from Titanic. We are interested in assessing the effect of passenger’s class on death from the tragedy. Answer the following questions.
2.1Create a bar-graph describing percent of death by passenger’s class separately for gender.
2.2Conduct simple logistic regressions using death as an outcome variable, and passenger class as a predictor variable (treat passenger class as categorical variable) separately for males and females. Summarize your findings on the association between passenger’s class and death for males and females also explain OR’s and 95%CI in biological terminologies.
2.3Now treat passenger class as a continuous variable, and conduct a similar analysis to the above. Summarize your findings on the association between passenger’s class and death for males and females also explain OR’s in biological terms. Does your conclusion differ from the above when you treat passenger’s class as a continuous variable?
2.4Do you think that the analysis you conducted in 2.3 appropriately explain the association between death and passenger’s class separately for gender males and females? Please state reasons for your answer.
3. Head_Neck32.sav contains data from a randomized clinical trial for patients with head and neck cancer comparing two arms: experimental drug and control drug. The survival time in months (SURVTIME) and survival status (STATUS: 1=died, 0=alive) are provided in the dataset. Also, the trialist recorded age of patients (AGE) and the size of the patient’s tumor (TSIZE) at baseline, with T1 + T2 tumors representing smaller tumors and T3 + T4 representing larger tumors. Answer the following questions.
3.1 Plot K-M curves for patients with head and neck cancer by treatment assignment (You don’t need to edit the graph to make it fancy). Compute median survival time and 95% CI of the median survival time for both arms. Perform a Log-rank test to assess the effect of treatment on patient’s survival. Also obtain hazard ratio of death by treatment assignment. Explain your finding in biological terminology.
3.2Plot Kaplan-Meier survival curves for patients with head and neck cancer by treatment assignment stratified by tumor size (i.e., separately for patients with tumor size = T1+ T2, and T3+T4). Make X-axis ranges from 0 to 25 months for both plots. Obtain hazard ratio of death by treatment assignment to evaluate whether treatment prolongs patient’s survival separately among those with T1+T2 and with T3 + T4.
3.3Based on the result of 3.2, do you want to present the two hazard ratios separately, or present one hazard ratio of death by treatment? Assess the role of tumor size (confounder or effect modifier) on the association between treatment and death. Does tumor size seem to confound the association between death and treatment? What is your conclusion on the association between death and treatment after adjusting for tumor size?
4. Rothman_logistic_modified.sav contains data from a randomized controlled trial of 188 patients with type 2 diabetes. Outcomes were collected at 12 months from study enrollment. Primary outcomes were improvements in hemoglobin Ha1c (yes/no whether HbA1c< 7.0 at 12 months). We want to assess how patient’s literacy level affects the effect of the intervention v.s. control on the study outcome.
Data dictionary:
Status: 1: intervention; 0:control
Ha1c12b7: 1: ha1c at 12 month < 7.0; 0: otherwise
Realm2a6: Literacy level: 1: High, 0: Low
Age : Age in years at baseline
Gender: 1: Male; 0: Female
Race: 1: African American; 0: non African American
Duration: Duration of diabetes in years at baseline
Education: 1: <HS 0: >=HS
Insulin: Insulin use at baseline (1: yes, 0: no)
Income: 0: > 20K annual, 1:<=20K annual
4.1 Using a binary logistic regression, compute odds ratio (and p-values) of intervention compared to control group in improvement of HbA1c stratified by (separately for) literacy level. Summarize your finding assessing the effect of the intervention separately for literacy level.
4.2 Do the two literacy-stratified odds ratios seem to differ in the analysis above? Conduct an analysis using a logistic regression to compute OR for the interaction. What does the OR for the interaction mean in biological terminology?
4.3 Given a power for this analysis, not all covariates can be included in the logistic regression model. Thus let’s attempt data-reduction using the principal components method. Combine all 7 covariates into 2 principal components, and add the two principal components into the logistic regression of 4.2 so that we are able to analyze the effect of the interaction with adjustment for all covariates without losing statistical power.