X1 Exercises
Hein Stigum
1. Precision and validity
POPs and birth weight Study
Persistent organic pollutants (POPs) are chemical substances that persist in the environment, bioaccumulate through the food web, and pose a risk of causing adverse effects to human health and the environment. This group of pollutants consists of pesticides (such as DDT), industrial chemicals (such as polychlorinated biphenyls, PCBs) and unintentional by-products of industrial processes (such as dioxins and furans).
Animals accumulate POPs in fat through their food; concentrations increase at each step in the food chain. Humans are exposed to POPs through food, mainly from fat fish. Breast fed babies are exposed through mothers milk. Some POPs have toxic effects to animals, affecting development, reproduction, the immune system and the uterus.
You want to study the effect of Persistent organic pollutants (POPs) on birth weight. You recruited a random sample of 400 women from pregnancy controls. The mothers sampled milk after birth, and the levels of different POPs in the milk were analyzed. You now have a file with these levels, the birth weight of the babies, information on possible problems during pregnancy, and the usual list of background variables: age, education, …
a) Describe why random errors occur in the study. Where do the effects of random error turn up in the analysis? How much do you need to increase sample size to double the precision? (Hint: the precision is proportional to the square rot of the sample size).
b) Describe some sources of systematic errors in the study.
c) You have estimated the crude association between the exposure (one of the POPs) and the outcome (birth weight) in a regression model. You add all the background variables in one block to the model and see some change in the exposure-outcome association. A colleague tells you that you must test the two models against each other with a likelihood ratio test to see if the confounding is real. What is your view on this?
2. Causation
Below are two statements on causation. Discuss these.
a) In a cohort study the exposed and unexposed groups should be as equal as possible, except for exposure, for us to draw conclusions on cause.
b) Likewise, in a case-control study the cases and the controls should be as equal as possible, except for disease, for us to draw conclusions on cause.
3. Additive vs. multiplicative scale
In a study on the effect of depression on mortality you get the following (hypothetical) results:
You calculate the risk of death for depressed versus not depressed, and find a 10 times higher death risk if you are depressed in the young group, and a 4 times higher death risk if you are depressed in the old group. You conclude that: “The effect of depression on mortality decreases with increasing age”.
Some funny bloke tells you to work on the additive scale instead. Under much doubt you calculate the risk differences. Now what is your conclusion?
4. 2 by 2 tables
a) You are planning a cohort study about the effect of organic food consumption on low birth weight. Your funding allows 3000 subjects. You expect 10% with low birth weight, you assume that 3% of the population use organic food and you want to detect a 30% protection from organic food (that is an OR=0.7). Below is the calculation of expected precision (at 80% power).
Cohort allocation 1
The results are not very promising, why?
Below is a different allocation of the same number of subjects. What have we changed, and what is achieved?
Cohort allocation 2
b) The next example is a case-control study of the effect of lack of the mineral selen in the food on the risk of developing tuberculosis. Two different ways of allocating resources are shown. Discuss pro et con.
Case-control allocation 1
Case-control allocation 2
5. Sampling
c) In the persistent organic pollutants (POPs) and birth weight study your main suspect for causing low birth weight is a pesticide. Only a few women in your study have high levels of this pesticide, giving you a low variance in the exposure and hence a low power. You decide to enrich your sample with 50 pregnant women from a geographical area were the pesticide is in common use. What effects will this have on your study, how will you analyze, and what regression model will you use?
d) The pesticide analyzed in a) was not responsible for low birth weight. You now have a set of 5 to 6 POPs that are your exposure candidates. These all suffer from the same problem as before: only a few women with high values, that is low variation in the exposure. You again want to enrich the sample. You reason that it may be smarter to oversample cases of low birth weight; this will automatically give you more variation in the actual exposure (or exposures). You enrich your sample with 50 women taken from the medical birth registry with low birth weight babies. What effects will this have on your study, how will you analyze, and what regression model will you use?
6. Generalizing results
In a randomized controlled trial the researchers recruited a random sample of 400 subjects with risk of hypertension from an area of low socioeconomic status in the US. The patients were randomized to placebo or drug treatment, but the compliance was low; only 60% of the treatment group took the drug. The data was analyzed both by intention to treat and by average treatment effect. Lab results indicate that the effect of the drug on heart disease is independent of sex and race.
a) Intention to treat analysis: the treatment/placebo groups are defined by the randomization, regardless of whether the patients actually took the drug. The proportion with hypertension was 0.2 in the placebo group and 0.14 in the group randomized to treatment. This gives a risk difference of 0.06 and a relative risk of 0.7. Can you generalize this result to other populations, and what method of generalization are you using?
b) Average treatment effect analysis: the treatment group consists of patients who actually took the drug, the placebo group on those who did not. The proportion with hypertension was 0.2 in the untreated group and 0.1 in the group actually treated. This gives a risk difference of 0.1 and a relative risk of 0.5. Can you generalize this result to other populations, and what method of generalization are you using?
c) The average treatment effect may suffer from some confounding. How can this be the case in randomized trial?
(Although beside the point here: notice that the RDintention to treat=Compliance*RDtreatment as it should be)
Day 2
7. Frequency measures
You want to estimate the risk of developing diabetes type 1 among children. You recruit a (very very) small cohort of 10 year old children and follow them for 4 years. The graph summarizes the follow up of the six children in the cohort, x = disease, + = death, line = follow up.
a) Based on this cohort:
What is the prevalence of diabetes at age 14?
What is the 4-year risk of developing diabetes for 10 year old children?
b) Assume that the disease studied is not diabetes but otitis media (ear infection):
What is the prevalence of otitis media at age 14?
What is the 4-year risk of developing otitis media for 10 year old children?
The youth health study
The youth health study included 19200 individuals from the 10th grade in public and private schools in 6 counties during 2000-2004. They were given a questionnaire on health and lifestyle. 15 948 students answered the questions about sexual debut: Have you ever had sexual intercourse? If so, how old were you the first time? The graph below shows the failure functions from a survival analysis on this data.
c) Based on the graph, what is the approximate risk of having had a debut by the age of 16?
d) In epidemiological terms, what type of measure is this risk (prevalence, incidence proportion or incidence rate)?
e) What type of design is the youth health study (cross sectional, cohort or case control)?
8. Association measures
In the youth heath study (referred to above) students were asked about their current tobacco use in the form of smoking or snuff use. The investigators were interested in the sociodemographic patters of “pure” smokers, “pure” snuff users and combination users. Table 1 shows the results for daily smokers not using snuff (“pure” smokers) (Snuff and combination users are not shown in this exercise).
Table 1, Daily smoking among students not using snuff
a) What types of frequency measures are used? What types of association measures are used?
b) Calculate the odds of smoking for boys and girls. Calculate the crude (unadjusted) OR and crude RR of daily smoking for girls versus boys.
c) Looking again at the effect of girls versus boys you notice that the (adjusted) odds ratio and relative risk from the models are different, 3.0 versus 2.4. You discuss this with your colleagues and get four different suggestions: 1) This is caused by confounding, probably from a difference in educational plans for girls and boys. 2) This is caused by the high prevalence of smoking, particularly among girls. 3) The difference is not important, it is only marginally significant since the confidence intervals are touching (they are in fact just overlapping). 4) This is caused by global climate change. What is your opinion?
d) (Probably difficult!) I have a son in 10th grade with academic plans. I am (currently) living with my wife and, you may assume that my family economy is good. Can you tell from the RR-model what the probability is that my son is a smoker? I now supply the extra information that the exponential of the constant term from the model equals 0.043, and that this the expected prevalence for a person in the reference category for all the covariates. Can you now answer the previous question?
9. Attributable fractions
Data from the Norwegian Mother and Child Cohort show that obesity is a risk factor caesarean section. Below is a table of BMI categories, and a table of the frequency of caesarean section for each category of BMI.
a) Assuming that obesity causes caesarean section, calculate how many percent caesarean sections would drop in this population if obese women were (miraculously) turned into normal weight women.
b) Below are the results from a logistic regression of caesarean section on BMI and some possible confounders. The attributable fraction of obesity is calculated from the model (using “aflogit” in Stata). Should the population attributable fractions from a) and b) be the same?