Exploratory Data Analysis Project
Megan Erickson
2-22-18
Introduction to the ‘depress’ data set
This project uses the depress data set. Within this data set, we’re going to explore the variables (‘marital’), (‘income’), (‘age’), (‘drink’), and level of depression (‘cesd’). The data was obtained from Robin Donatello’s teaching course website. Additional details on this study can be found in Practical Multivariate Analysis, 5th edition by Afifi, May and Clark.
There are 294 observations of 37 variables in the ‘depress’ data set.
depress <-read.delim("/Users/meganerickson/Documents/Spring 2018/MATH 130/data/depress_081217.txt", header=TRUE,sep="\t")
head(depress)
## id sex age marital educat employ income relig c1 c2 c3 c4 c5 c6
## 1 1 1 68 Widowed Some HS Retired 4 1 0 0 0 0 0 0
## 2 2 0 58 Divorced Some college FT 15 1 0 0 1 0 0 0
## 3 3 1 45 Married HS Grad FT 28 1 0 0 0 0 1 0
## 4 4 1 50 Divorced HS Grad Unemp 9 1 0 0 0 0 1 1
## 5 5 1 33 Separated HS Grad FT 35 1 0 0 0 0 0 0
## 6 6 0 24 Married HS Grad FT 11 1 0 0 0 0 0 0
## c7 c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 cesd cases drink
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## 2 0 0 0 0 0 1 0 0 1 0 1 0 0 0 4 0 1
## 3 0 0 0 0 0 0 0 1 1 1 0 0 0 0 4 0 1
## 4 0 3 0 0 0 0 0 0 0 0 0 0 0 0 5 0 0
## 5 0 3 3 0 0 0 0 0 0 0 0 0 0 0 6 0 1
## 6 0 0 1 0 0 1 2 0 0 2 1 0 0 0 7 0 1
## health regdoc treat beddays acuteill chronill
## 1 2 1 1 0 0 1
## 2 1 1 1 0 0 1
## 3 2 1 1 0 0 0
## 4 1 1 0 0 0 1
## 5 1 1 1 1 1 0
## 6 1 1 1 0 1 1
I used a combination of dplyr, knitr, and ggplot2 packages for this project.
library(dplyr)
library(knitr)
library(ggplot2)
opts_chunk$set(warning=FALSE, message=FALSE, fig.height=4, fig.width=5, fig.align='center')
Univariate
We’re going to start analyzing our data by describing some univariants.
- Let’s calculate the mean and standard deviation of level of depression (cesd) in the sample.
mean(depress$cesd)
## [1] 8.884354
sd(depress$cesd, na.rm =FALSE)
## [1] 8.823655
The mean level of depression in the sample is 8.9. 0 is lowest level possible level and 60 is the highest possible level. The standard deviation is 8.8.
- Let’s look at box plot of the level of depression ‘cesd’ within the data set.
boxplot(depress$cesd, horizontal =TRUE, main="cesd", xlab="level")
This distribution shows us where the bulk of participants lie in terms of level of depression. Outliers are represented as dots.
- Look at a table of marital status within the data set.
table(depress$marital)
##
## Divorced Married Never Married Separated Widowed
## 43 127 73 13 38
- Before we dig deeper into this variable, we should know its class.
class(depress$marital)
## [1] "factor"
- If we wanted, we could use filter() to select only the participants who are married to get a closer look; (marital) is “Married”.
d1 <-depress %>%filter(marital == "Married")
head(d1)
## id sex age marital educat employ income relig c1 c2 c3 c4 c5 c6 c7
## 1 3 1 45 Married HS Grad FT 28 1 0 0 0 0 1 0 0
## 2 6 0 24 Married HS Grad FT 11 1 0 0 0 0 0 0 0
## 3 7 1 58 Married Some HS Houseperson 11 1 2 1 1 2 1 0 0
## 4 9 1 47 Married HS Grad Retired 23 2 0 1 1 0 0 3 0
## 5 10 0 30 Married Some HS FT 35 4 0 0 0 0 0 0 0
## 6 12 1 57 Married HS Grad PT 24 1 0 0 0 0 0 0 0
## c8 c9 c10 c11 c12 c13 c14 c15 c16 c17 c18 c19 c20 cesd cases drink
## 1 0 0 0 0 0 0 1 1 1 0 0 0 0 4 0 1
## 2 0 1 0 0 1 2 0 0 2 1 0 0 0 7 0 1
## 3 2 2 0 0 0 0 0 3 0 0 0 0 1 15 0 0
## 4 0 0 0 0 3 0 3 2 3 0 0 0 0 16 1 1
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 6 0 0 0 0 0 0 0 2 2 0 0 0 0 4 0 0
## health regdoc treat beddays acuteill chronill
## 1 2 1 1 0 0 0
## 2 1 1 1 0 1 1
## 3 3 1 1 0 1 1
## 4 4 1 1 1 0 1
## 5 1 1 0 0 0 0
## 6 2 1 1 1 1 1
- We can get an even better idea of marital distribution my making a bar chart.
ggplot(depress, aes(x=marital)) +geom_bar()
This bar chart shows marital frequencies along the x-axis, where the height of the bars is determined by frequencies seen in the table above (in section 3).
- Let’s take a look at a table of drinking status (drink).
table(depress$drink)
##
## 0 1
## 60 234
- We can create a new factor variable is_drinker from drink with labels “non-drinker” and drinker“. This makes the table easier to read.
depress$is_drinker <-factor(depress$drink, labels=c("non-drinker", "drinker"))
table(depress$is_drinker)
##
## non-drinker drinker
## 60 234
There are 60 non-drinkers and 234 drinkers in the ‘depress’ data set.
- Here is a bar chart of non-drinkers and drinkers.
ggplot(depress, aes(x=is_drinker)) +geom_bar()
This bar chart shows drinker frequencies along the x-axis, where the height of the bars is determined by frequencies seen in the table above.
- Let’s calculate the average and standard deviation of income of participants in the data set.
mean(depress$income)
## [1] 20.57483
sd(depress$income, na.rm =FALSE)
## [1] 15.29012
The mean income of participants is approximately 20,000 dollars per year. The standard deviation is approximately 15,000 dollars per year.
- A histogram of income (income).
ggplot(depress, aes(x=income)) +geom_histogram(binwidth =5)
The height of the bars in this histogram display the frequency of income values in thousands of dollars. The x-axis is continuous.
- We can also re-create the ‘income’ histogram with an overlaid density plot.
ggplot(depress, aes(x=income)) +geom_density(col="purple") +
geom_histogram(aes(y=..density..), colour="black", fill=NA, binwidth =5)
The density curve gives us a better idea of the distribution of ‘income’. The height of the bars in this histogram display the frequency of income values in thousands of dollars. The x-axis is continuous.
- Let’s create a horizontal box plot of the ages of participants in the study (age).
boxplot(depress$age, horizontal =TRUE, main="Age", xlab="Years")
This distribution shows us where the bulk of participants lie in terms of age. It looks like the mean age of participants is between 40 and 45 years old. The minumum age is between 15 and 20 years old, and the maximum age is almost 90 years old.
- To confirm our box plot is accurately displaying the data, we can quickly calculate some statistics of the ‘age’ variable.
mean(depress$age)
## [1] 44.41497
sd(depress$age, na.rm =FALSE)
## [1] 18.08544
min(depress$age)
## [1] 18
max(depress$age)
## [1] 89
The mean age of participants in the sample is 44.4 years old. The standard deviation is about 18 years. The youngest person in the sample is 18 years old, and the oldest person is 89 years old.
Bivariate
- Now we’ll make some bivariate comparisons. We can create a two-way frequency table of marital status (‘marital’) against drinker status (‘is_drinker’).
table(depress$marital, depress$is_drinker)
##
## non-drinker drinker
## Divorced 9 34
## Married 29 98
## Never Married 10 63
## Separated 4 9
## Widowed 8 30
- I can now create a proportion table of drinking habit within marital status and round to one digit.
prop.table(table(depress$marital, depress$is_drinker))
##
## non-drinker drinker
## Divorced 0.03061224 0.11564626
## Married 0.09863946 0.33333333
## Never Married 0.03401361 0.21428571
## Separated 0.01360544 0.03061224
## Widowed 0.02721088 0.10204082
print(prop.table(table(depress$marital, depress$is_drinker)), digits =1)
##
## non-drinker drinker
## Divorced 0.03 0.12
## Married 0.10 0.33
## Never Married 0.03 0.21
## Separated 0.01 0.03
## Widowed 0.03 0.10
- Now let’s see a grouped bar chart that reflects the proportions of drinking status within marital status.
ggplot(depress, aes(x=is_drinker, fill=marital)) +geom_bar()
Our grouped bar chart agrees with our above tables. According to the data, most participants in the data set who are drinkers are also married.
- A scatter plot of the variable income in thousands of dollars per year and level of depression cesd. I included a smoother line in orange, and a best fit linear model line in blue.
ggplot(depress, aes(x=income, y=cesd)) +geom_point() +
geom_smooth(se=FALSE, method="lm", color="blue") +
geom_smooth(se=FALSE, color="orange")
This is not the prettiest-looking scatter plot. However, it still shows a negative relationship between level of depression (‘cesd’) and income. That is, people with lower incomes tend to have higher levels of depression.
- Grouped box plots of level of depression cesd by drinker status is_drinker.
ggplot(depress, aes(x=is_drinker, y=cesd, fill=is_drinker)) +geom_boxplot()
From this plot, we can see drinkers have a slightly higher mean level of depression than non-drinkers.
- We can replicate the same plot as above, but overlay a violin plot and change the transparency of both violin and box plot layers.
ggplot(depress, aes(x=is_drinker, y=cesd, fill=is_drinker)) +
geom_violin(alpha=.1) +
geom_boxplot(alpha=.5, width=.2)
The violin plot shows that most participants, whether drinkers or non-drinkers, have relatively low levels of depression. Outliers are represented as dots.
- Finally, we can create side-by-side histogram and overlaid density plots of income by drinking status is_drinker.
e1 <-ggplot(depress, aes(x=income, col=is_drinker)) +geom_histogram(binwidth =12)
e2 <-ggplot(depress, aes(x=income, col=is_drinker)) +geom_density(binwidth =12)
gridExtra::grid.arrange(e1,e2, ncol=2)
From the histogram and overlaid density plot, we can see that there are more non-drinkers with low income than there are drinkers with low income.