To Describe Trends in Data, and to Verify That Random Sampling Works

Lab Objective

To describe trends in data, and to verify that random sampling works.

To begin answering the questions which motivate studies in the first place by constructing credible intervals and interpreting them. (Hypothesis testing)

Questions:

Read in the dataset agpop from the website, then answer the following questions:

1.) How is missing data encoded in this dataset? How would it influence our subsequent inquiries into this data if we ignored the fact that this represents missing data and carried on with analysis as usual? Additionally, how would excluding these data influence our inquiries?

Run the code I’ve given you to remove all rows with missing values from the dataset.

This isn’t perhaps the best thing to do. Often times we `impute’ the missing data by comparing the missing cells to other observations with similar recorded variables which we do have data for.

Imputation is serious business though – and you should be extremely careful when doing it. Consulting a statistician is probably best in this case.

2.) Row 1642 contains the data for Durham County, NC. What is the trend in total acres devoted to farmland in Durham County, NC from 1982 to 1992? That is, did Durham become more or less agriculturally based over those 10 years? Report numbers to back up your claims.

3.)What is the trend in total acres devoted to farmland in the state of North Carolina over the same period?

4.) How about the trend for Wibaux County, MT? (It’s row 1611)

5.) Take a random sample of 500 counties.(there are 3041 remaining counties)Based on comparisons between the sample means and population means, does it seem that picking counties at random provides a representative sample? Once you have generated your sample, use the following code to compare overall and sample means:

for(i in 1:12){

print(names(agpop)[i+1])

print(mean(agpop[samples,i+1]))

print(mean(agpop[,i+1]))

print('****')

}

6.)Same as number 5, except only take 200 samples. Compare your results to those from number 5.

Suppose we follow a group of women of similar ethnicities and age in a prolonged survey to see what their risk of breast cancer is. Upon completing the study, we observe that 8 women out of the 92 we surveyed developed breast cancer.

7.) Calculate the posterior distribution of the probability of developing breast cancer over the course of the study. What is your best guess for the value of the probability of developing breast cancer over the course of the study?

8.) Calculate a 95% credible interval for the probability of developing breast cancer over the course of the study.

9.) Test the hypothesis: The probability of developing breast cancer over the course of the study is 0.20.

What would we have concluded if we were testing the hypothesis that the probability of developing breast cancer over the course of the study is 0.15?

How about 0.05?