Homework #2: Central Tendency and Variability

Psych 31011

Keller10/22/18

Name:______

Homework #2: Central Tendency and Variability.

Due: 2/9/2012

Part 1: To be done in R:

Question 1:

A) Select one of the variables listed below from the fcq dataset and write a description of it. To load fcq into R, as well as lots of other datasets we’ll be using for this and other HWs, just type this into R:

load(url("

Use the ls() command to view the available datasets, use ls(fcq) to view the variable names in the fcq dataset.

Base your description on measures of the center (typical value) and examination of the histogram. Describe the shape of the distribution as well as a verbal description of its spread - whether most scores are close to the measure of central tendency (little spread) or whether scores tend to be far away from the measure of central tendency.

B)When there are many observations, another useful plot in R is the estimated density function, which is really a smoothed histogram. For example, to obtain this plot for the course variable, you would use:

plot(density(fcq$course))

Note the use of “$” to ‘grab’ the variable or column of the dataset that you are interested in. You can also control the number of bins in the histogram. Again with many observations, a greater number of categories or bins (or breaks) than the default number is appropriate. For example, try

hist(fcq$course, breaks=30)

Select one of the following variable from the fcq dataset and generate the density function and histogram:

instructor – the average student rating of the instructor on 0 to 4 scale

percentReturn – the percentage of enrolled students returning valid forms

nEnrolled – the official size of the class

avgGrade – the average grade received by students in the class

percentA – the percentage of students in the class receiving grades of A

percentD_F – the percentage of students in the class receiving D’s or F’s

NOTE: variable names in R are case-sensitive, so “avgGrade” is not the same as “avggrade.” Also, remember that each observation in this dataset corresponds to an entire class of students. There is no information about individual students in this dataset. To simplify typing, you can assign a new, shorter name to any variable. For example, to avoid having to type nEnrolled, you could first assign the name size as:

size <- fcq$nEnrolled

and then simply type size whenever you wanted to refer to nEnrolled.

C)Write a summary describing the variable you chose above. Begin your paragraph with a sentence suggesting why we would be interested in this variable and end your paragraph with a no-number sentence stating some conclusion or overall summary.

Question 2:

In the survey you completed on the first day of class, you were asked how many times per day you check email. We’re interested in the following question: do lower division students (freshmen and sophomores) check email more than upper division students (junior and seniors)?

A) Create and interpret two side-by-side boxplots to help figure out this question:

boxplot(survey$email_check ~ survey$class_status)

Describe what the two boxplots are telling you (i.e., describe the measures of central tendency and spread and whether these statistics are different between upper vs. lower division status).

B) How sure are you that lower division students at CU check email more frequently? Is it likely that the two sample means or medians will be exactly the same, even if the population means are equal? Why or why not?

C) What is the standard deviation of how often lower division students check their email? What is it for upper division students? Explain both of these numbers in plain English in a way that your grandma could understand. To do this in R for lower division students, do this:

sd(survey$email_check[survey$class_status=="LOW"],na.rm=TRUE)

Question 3:

In this class, we will often ask you for a “four sentence summary of results.” (NOTE: the four sentence summary will be on all 4 tests you take in this class). The four sentence summary mirrors the four sections of an APA formatted paper, and is made up of the following:

1st sentence: The Introduction. State the problem, or what you are interested in looking at.

2nd sentence: The Method. How did you go about solving this problem?

3rd sentence: The Results. What did you find?

4th sentence: The Discussion. What is your conclusion?

For example, let’s say we’re interested in whether female undergraduates check their email more frequently than male undergraduates. We could use our survey results to try to answer this in the following way:

Example 4-sentence summary

We are interested in whether females check their email more frequently on average than males do. To investigate this problem, we asked 71 females and 37 males enrolled in an undergraduate statistics class at the University of Colorado how many times they checked their emails per day. We found that females check their emails 5.6 times per day on average (SD=5.56) whereas males check their emails 3.8 times per day (SD=3.97). We conclude that female undergraduates do indeed appear to check their email more frequently, although we cannot say for certain if this difference arose by chance or not (exists only in this sample but not in the population).

I used the following R syntax to find these results:

males <- survey[survey$gender=="M",] #this gave me a dataset of all the males

females <- survey[survey$gender=="F",] #this gave me a dataset of all the females

mean(males$email_check, na.rm=TRUE)

sd(males$email_check, na.rm=TRUE)

mean(females$email_check, na.rm=TRUE)

sd(females$email_check, na.rm=TRUE)

Choose any dependent variable in the “survey” dataset you wish, and compare this dependent variable (its means and standard deviations) across two groups (males vs. females; employed vs. not, people with boy/girlfriends vs. not; people in fraternities/sororities vs. not; people from Colorado vs not; people who expect to get A’s vs. not, etc.). The dependent variable you choose should be a continuous or interval variable. The independent or quasi independent variable you choose should be nominal.

A) Identify the independent or quasi-independent variable.

B) Identify the dependent variable.

C) What type of study are you conducting (e.g., an experiment?)? Can you make causal inferences that the independent variable caused a change in your dependent variable? Why or why not?

D) Attach a boxplot to your HW comparing the two variables (as was done in 4A). Comment on what the boxplot shows you.

E) Attach a histogram of the dependent variable to your HW. Comment on what the histogram shows you. To make a histogram in R:

hist(x) #where x is your dependent variable

F) When R calculated the standard deviation of your dependent variable, did it divide by n or by n-1? Why?

G) Trying writing up a four sentence summary of your findings (good practice for when you will do it on a test!).

Part 2: To be done by hand:

Question 1:

Forthefollowingsetofscores:3326208123725342926303315353831

A)Drawastem-and-leafplot

B) Computethemean, median, and mode(showyourwork)

C)What is the range? What is a disadvantage of the range statistic?

D)What is the median, 1st quartile, 3rd quartile, and inter-quartile range?

C) Which measure of central tendency do you think is best of this distribution? Why?

Question 2:

Asampleofn=5scoreshasameanof10.Onenewscoreisaddedtothesampleandthe

newmeaniscomputedtobe11.Whatisthevalueofthescorethatwasaddedtothe

sample?Show your work.

Question 3:

The median height in a sampleis 70 inches. The (very tall) basketball player, Shaq O’Neil is part of this sample.

A) Describe the likely shape of the distribution of heights.

B) If there are 9 people in the sample, what percent of the sample has a height less than 70?

C) Would you guess that the mean is greater than, equal to, or less than 70? Why?

D) If we remove Shaq from the sample and add in John Doe (who is 72 inches tall), what is the median of this new sample?

Question 4:

A) John (an imaginary friend of yours) says that Mark is a jerk. How sure are you that Mark is a jerk? What other possibilities might exist for why John said this other than that Mark really is a jerk?

B) Now say that Jill, James, Julie, and Janice also said that Mark is a jerk. Does this change your subjective guess (i.e., your internal probability) that Mark really is a jerk? Why or why not? Try to explain this in a very specific (preferably mathematical) way.

C) Now say that you found out that Jill, James, Julie, and Janice are all friends of John’s, and that you find out that John had been talking to the four of them about Mark before they told you that Mark is a jerk. Does this new information change your subjective guess about whether Mark is a jerk? Why or why not (again, try to be very specific)?

Question 5:

Explain in words what sums of squares, degrees of freedom, and variance are:

Question 6:

A) Let us say that somehow we know that the TRUE POPULATION MEAN of a set of numbers (that is, whatever you get for the sample mean is also the population mean). Given this information, write the formulas for the variance and the standard deviation of for the population:

B) What is wrong with your answer above if we do not know the true population mean of the underlying distribution from which those numbers were drawn? Describe the issue in words and then write the correct formulas for the variance and the standard deviation of for the sample below: