Project #1 Answers
STAT 873
Fall 2013
Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document.
This problem is partially based on an example in Chapters 2 and 5 of Johnson (1998). Below is the description of the data from Johnson:
Forty-eight individuals who had applied for a job with a large firm were interviewed and rated on 15 criteria. Individuals were rated on
- The form of their letter of application (FL)
- Their appearance (APP)
- Their academic ability (AA)
- Their likability (LA)
- Their self-confidence (SC)
- Their lucidity (LC)
- Their honesty (HON)
- Their salesmanship (SMS)
- Their experience (EXP)
- Their drive (DRV)
- Their ambition (AMB)
- Their ability to grasp concepts (GSP)
- Their potential (POT)
- Their keenness to join (KJ)
- Their suitability (SUIT)
Each criterion was evaluated on an integer scale ranging from 0 to 10 with 0 being a very low and very unsatisfactory rating, and 10 being a very high rating.
The data is in the file job_applicant.csv which is available on my website. Using this data, complete the following.
1)(10 points) Examine appropriate plots of the data and interpret them in the context of the problem. In your interpretations, make sure to specifically indicate which individuals are overall good applicants.
I used 32-bit R to complete this project.
In order to represent all 15 variables on a plot at once, the most appropriate plots are stars and parallel coordinate plots.
> set1<-read.table(file = "C:\\chris\\job_applicant.csv", header = TRUE, sep =
",")
head(set1)
Applicant FL APP AA LA SC LC HON SMS EXP DRV AMB GSP POT
1 1 6 7 2 5 8 7 8 8 3 8 9 7 5
2 2 9 10 5 8 10 9 9 10 5 9 9 8 8
3 3 7 8 3 6 9 8 9 7 4 9 9 8 6
4 4 5 6 8 5 6 5 9 2 8 4 5 8 7
5 5 6 8 8 8 4 4 9 5 8 5 5 8 8
6 6 7 7 7 6 8 7 10 5 9 6 5 8 6
KJ SUIT
1 7 10
2 8 10
3 8 10
4 6 5
5 7 7
6 6 6
win.graph(width = 11, height = 7)
stars(x = set1[,-1], draw.segments = TRUE, key.loc = c(20,10), main = "Job
applicant star plot", labels = set1$Applicant)
Because larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this quality include 39 and 40 where only their appearance (APP) is somewhat in the middle in comparison to others. Applicants 2, 7, 8, 9, 22, 23, and 24 tend to have most of their rays being large. Applicants 28 and 29 tend to have the smallest rays overall, so they may be the least desirable to hire. There are a number of other individuals with small values for some criteria, but large values in other areas. For example, applicants 41 and 42 stand out as having a lot of experience (EXP), good form to their application letter (FL), highly suitable (SUIT), and good academic ability (AA), but they lack in other areas.
library(MASS)
parcoord(x = set1, main = "Job applicant parallel coordinate plot", col =
1:nrow(set1))
> #Highlight #39 and 40 that stood out in the stars plot
color.select<-ifelse(test = set1$Applicant == 39 | set1$Applicant == 40, yes =
"red", no = "black")
> #Also could use
> #color.select<-c(rep(x = "black", times = 38), "red", "red", rep(x = "black",
times = 8))
lwd.select<-ifelse(test = set1$Applicant == 39 | set1$Applicant == 40, yes = 2,
no = 1) #Helpful due to overlapping of some lines
parcoord(x = set1, main = "Job applicant parallel coordinate plot (#39 and #40
highlighted in red)", col = color.select, lwd = lwd.select)
library(iplots)
ipcp(vars = set1)
The second plot above shows that #39 and #40 almost always have ratings close to 10, except for APP.
One could do some brushing on the third plot. I used brushing on a few of the variables to determine if large values for a particular variable of interest tended to also coincide with large values for other variables. For example, the suitability of an applicant is likely a very important quality. When I highlight those applicants with ratings of 9 and 10, I obtain the plot below:
We see that a mix of scores can occur with some variables such as academic ability (AA), but a much smaller mix of scores occurs for some areas like lucidity (LC) and honesty (HON).
Parallel coordinate plots can be a little more difficult to interpret with this type of data due to the discreteness of the ratings. In particular one needs to realize that some observations may be hidden due to this discreteness. For example, it is difficult to know how many observations truly are at a level of 10 for honesty (HON). While I did not require students to solve this problem, I did expect students to at least identify it (discreteness problems were discussed in class).
One possible way to solve the problem described above is to jitter each of the ratings by a small amount. For example, add a simulated value from a N(0,0.1) distribution to each rating. Below is an example:
> N<-nrow(set1)
set.seed(7811)
> set1.jit<-set1[,-1] + matrix(data = rnorm(n = N*15, mean = 0, sd = 0.1), nrow =
N, ncol = 15)
parcoord(x = set1.jit, main = "Job applicant parallel coordinate plot (#39 and
#40 highlighted in red)", col = color.select, lwd = lwd.select)
Due to the jittering, one needs to be careful with judging when one observation has a “higher” rating than another observation very close to it.
One group used my parcoord2() function to plot the data in its original units. This is desirable because each variable is measured on a 0 to 10 scale.
Trellis plots can be helpful as well. Based on a plot given by a group, below is one possible plot:
library(lattice)
var.names<-names(set1)[-1] #Variable names
> set1.long<-reshape(data = set1, idvar = "Applicant",
varying = var.names, v.names = "score", timevar = "Criteria", times =
var.names, direction = "long")
row.names(set1.long)<-NULL
set1.long[set1.long$Applicant == 1,] #Applicant #1
Applicant Criteria score
1 1 FL 6
49 1 APP 7
97 1 AA 2
145 1 LA 5
193 1 SC 8
241 1 LC 7
289 1 HON 8
337 1 SMS 8
385 1 EXP 3
433 1 DRV 8
481 1 AMB 9
529 1 GSP 7
577 1 POT 5
625 1 KJ 7
673 1 SUIT 10
> #Next, I need to create a variable that nicely displays the panel titles
set1.long$Applicant2<-as.factor(paste("Applicant #", set1.long$Applicant, sep =
""))
class(set1.long$Applicant2)
[1] "factor"
> #levels(set1.long$Applicant2) #Notice ordering is not ideal, and this ordering
will be used by histogram()
set1.long$Applicant3<-factor(x = set1.long$Applicant2, levels = paste("Applicant
#", 1:48, sep = "")) #Specifies one ordering to the levels of the factor
levels(set1.long$Applicant3)
[1] "Applicant #1" "Applicant #2" "Applicant #3" "Applicant #4" "Applicant #5" "Applicant #6" "Applicant #7"
[8] "Applicant #8" "Applicant #9" "Applicant #10" "Applicant #11" "Applicant #12" "Applicant #13" "Applicant #14"
[15] "Applicant #15" "Applicant #16" "Applicant #17" "Applicant #18" "Applicant #19" "Applicant #20" "Applicant #21"
[22] "Applicant #22" "Applicant #23" "Applicant #24" "Applicant #25" "Applicant #26" "Applicant #27" "Applicant #28"
[29] "Applicant #29" "Applicant #30" "Applicant #31" "Applicant #32" "Applicant #33" "Applicant #34" "Applicant #35"
[36] "Applicant #36" "Applicant #37" "Applicant #38" "Applicant #39" "Applicant #40" "Applicant #41" "Applicant #42"
[43] "Applicant #43" "Applicant #44" "Applicant #45" "Applicant #46" "Applicant #47" "Applicant #48"
win.graph(width = 11)
histogram(x = ~ score | Applicant3, data = set1.long, type = "count", layout
= c(8,6), xlab = "Score", main = "Histograms of scores by applicant")
This plot helps to show that Applicants #39 and 40 obtain mostly 10’s.
2)This part involves performing a PCA for the data using the correlation matrix.
a)(3 points) Discuss the positive and negative aspects of using the covariance matrix for a PCA rather than the correlation matrix.
In order to receive full credit for this problem, the answer needs to be in the context of the data being analyzed for the project.
Because all of the variables are measured on the same scale, it may be of interest to use the covariance matrix rather than the correlation matrix.
Below are the standard deviations and variances for the data:
> apply(X = set1[,-1], MARGIN = 2, FUN = sd)
FL APP AA LA SC LC
2.673749 1.966023 1.987550 2.805690 2.418072 3.170048
HON SMS EXP DRV AMB GSP
2.534514 3.439381 3.308529 2.947457 2.935401 3.035254
POT KJ SUIT
3.183443 2.657036 3.300279
> apply(X = set1[,-1], MARGIN = 2, FUN = var)
FL APP AA LA SC LC
7.148936 3.865248 3.950355 7.871897 5.847074 10.049202
HON SMS EXP DRV AMB GSP
6.423759 11.829344 10.946365 8.687500 8.616578 9.212766
POT KJ SUIT
10.134309 7.059840 10.891844
Due to some differences between the above values, we see that some variables will play a larger role in the PCA than others. This may or may not be of interest. For example, is an applicant’s salesmanship (SMS) more important than academic ability (AA)? A PCA with the covariance matrix will treat SMS as more important due its variance being almost 3 times larger than the variance of AA.
Note that Johnson does the PCA with both the covariance and correlation matrices. Personally, I feel more comfortable with using the correlation matrix in this setting.
b)(5 points) Determine the number of PCs needed when using the correlation matrix.
> #Using x = set1[,-1] can be an easier way to specify the variables when there are
a lot of them
pca.cor<-princomp(x = set1[,-1], cor = TRUE, scores = TRUE)
summary(pca.cor, loadings = TRUE, cutoff = 0.0)
Importance of components:
Comp.1 Comp.2 Comp.3
Standard deviation 2.7411301 1.4339809 1.20657345
Proportion of Variance 0.5009196 0.1370867 0.09705463
Cumulative Proportion 0.5009196 0.6380064 0.73506099
Comp.4 Comp.5 Comp.6
Standard deviation 1.09448513 0.85973985 0.70326316
Proportion of Variance 0.07985985 0.04927684 0.03297194
Cumulative Proportion 0.81492084 0.86419768 0.89716961
Comp.7 Comp.8 Comp.9
Standard deviation 0.59267346 0.55668844 0.50691374
Proportion of Variance 0.02341746 0.02066013 0.01713077
Cumulative Proportion 0.92058707 0.94124720 0.95837797
Comp.10 Comp.11 Comp.12
Standard deviation 0.43001206 0.39074335 0.312350893
Proportion of Variance 0.01232736 0.01017869 0.006504205
Cumulative Proportion 0.97070533 0.98088402 0.987388228
Comp.13 Comp.14 Comp.15
Standard deviation 0.298024834 0.254230665 0.189009390
Proportion of Variance 0.005921253 0.004308882 0.002381637
Cumulative Proportion 0.993309481 0.997618363 1.000000000
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
FL -0.162 0.429 -0.315 -0.094 0.114 0.621 0.171
APP -0.213 -0.035 0.023 0.262 0.870 -0.038 -0.010
AA -0.040 0.237 0.430 0.636 -0.213 0.223 0.311
LA -0.225 -0.130 -0.466 0.345 -0.129 0.112 -0.131
SC -0.290 -0.249 0.241 -0.173 0.005 0.020 0.143
LC -0.315 -0.131 0.150 -0.071 -0.207 0.175 -0.515
HON -0.158 -0.405 -0.284 0.416 -0.064 -0.304 0.144
SMS -0.324 -0.029 0.186 -0.198 0.037 -0.118 0.010
EXP -0.134 0.553 -0.083 0.068 -0.103 -0.367 -0.113
DRV -0.315 0.046 0.080 -0.156 -0.201 -0.250 0.490
AMB -0.318 -0.068 0.209 -0.199 0.163 0.113 0.201
GSP -0.331 -0.023 0.117 0.075 -0.082 0.148 -0.408
POT -0.333 0.022 0.073 0.188 -0.127 0.059 -0.016
KJ -0.259 -0.082 -0.467 -0.201 -0.112 0.075 0.247
SUIT -0.236 0.421 -0.089 -0.020 0.081 -0.414 -0.173
Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14
FL -0.155 0.411 0.033 -0.121 -0.185 0.032 0.091
APP 0.009 -0.071 0.014 0.114 -0.004 -0.314 -0.087
AA 0.043 -0.065 0.290 0.184 0.174 0.038 -0.011
LA 0.308 -0.369 0.172 -0.417 0.025 -0.059 0.149
SC -0.386 0.126 0.226 -0.423 0.339 0.001 -0.380
LC -0.024 0.124 0.250 -0.025 0.002 -0.421 0.113
HON -0.344 0.424 -0.007 0.051 -0.145 0.202 0.217
SMS 0.142 -0.083 0.478 0.171 -0.633 0.299 -0.032
EXP -0.584 -0.348 -0.012 -0.082 -0.126 -0.138 0.046
DRV 0.256 0.157 -0.199 0.026 -0.062 -0.558 0.225
AMB -0.041 -0.336 -0.209 -0.179 0.181 0.378 0.524
GSP -0.106 0.082 -0.335 0.450 0.163 0.109 0.161
POT 0.149 -0.051 -0.555 -0.170 -0.286 0.131 -0.557
KJ -0.052 -0.275 0.144 0.525 0.286 0.009 -0.301
SUIT 0.382 0.353 0.138 -0.083 0.396 0.292 -0.051
Comp.15
FL -0.028
APP 0.022
AA 0.068
LA -0.298
SC -0.302
LC 0.499
HON 0.173
SMS -0.171
EXP -0.001
DRV -0.171
AMB 0.302
GSP -0.527
POT 0.238
KJ 0.203
SUIT 0.099
plot(pca.cor, type = "lines", main = "Scree plot for job applicant data")
Notice the following:
- There are four PCs with eigenvalues greater than 1
- The scree plot tends to level off after four PCs
- 73.5% of the total variability in the data is accounted for by 3 PCs; 81.5% of the total variability in the data is accounted for by 4 PCs
At least three PCs are needed due to the amount of variability being explained. Four PCs may be enough to use.
It is also important to note that the first PC accounts for 50% of the total variation! Thus, only one PC is needed to account for roughly half of the “information” in the data. Given the interpretation of PC #1 (see next part), this component is very important to examine when judging whether or not to hire an applicant.
c)(8 points) Interpret the PCs chosen from b).Make sure to specifically comment on whether positive or negative scores (or scores close to 0) for a PC would likely be preferred by the firm.
To help see which of the loadings are “away from zero”, one could set the cutoff argument value in summary() to something other than 0.0. For example, below are some of the results when using a value of 0.2.
summary(pca.cor, loadings = TRUE, cutoff = 0.2)
Importance of components:
<OUTPUT EDITED>
Loadings:
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7
FL 0.429 -0.315 0.621
APP -0.213 0.262 0.870
AA 0.237 0.430 0.636 -0.213 0.223 0.311
LA -0.225 -0.466 0.345
SC -0.290 -0.249 0.241
LC -0.315 -0.207 -0.515
HON -0.405 -0.284 0.416 -0.304
SMS -0.324
EXP 0.553 -0.367
DRV -0.315 -0.201 -0.250 0.490
AMB -0.318 0.209 0.201
GSP -0.331 -0.408
POT -0.333
KJ -0.259 -0.467 -0.201 0.247
SUIT -0.236 0.421 -0.414
<OUTPUT EDITED>
Large values for each of the fifteen original variables would likely be preferred. For example, the larger SUIT is, the more suitable an applicant would be for the firm. This type of thinking then needs to be taken into account when interpreting the PCs. Below are my interpretations:
PC #1: All of the loadings are negative. Thus, the smaller the value of PC #1, the better the applicant. This PC appears to be an overall measure of the applicant.
PC #2: This appears to be a contrast between FL, AA, EXP, SUIT and SC, HON. The interpretation here is difficult. Notice that SC and HON appear to be character issues. AA and EXP are easily measurable qualities (e.g., received all A’s in school or has 10 years of prior experience). Perhaps FL is also easily measurable too and reflects experience (e.g., are their misspellings? Is it organized professionally?). Excluding SUIT then, the PC could be a contrast between measurable aspects and more difficult to measure aspects. It would be nice to know more about what SUIT really is in order to better interpret the PC.
PC #3: This appears to be a contrast between AA, SC, AMB and FL, LA, KJ. The interpretation here is again difficult. People who have a lot of self-confidence (SC) and ambition (AMB) are likely to do well in school (large AA). A person who is lucid (LC) is likely to have a good form to their cover letter (FL). I am not sure how KJ factors in with LC and FL.
PC #4: This appears to be a contrast between APP, AA, LA, HON and KJ. The interpretation here is again difficult. Note that 64-bit R users had their eigenvector be -1 multiplied by the eigenvector given above.
With respect to PCs #2 – #4, it is not necessarily true that large positive or small negative scores would be desirable. A later part in this project will address this more.
Overall, you can see how it can be difficult to interpret PCs. A subject matter researcher could likely judge better why particular linear combinations of the original variables make sense.
d)(5 points) Examine plots of the PC scores and interpret them in the context of the problem. For example, what do you think of applicant #42?
Below are a number of 3D and 4D plots:
> #Need to change the scale component of pca.cor list
pca.cor$scale<-apply(X = set1[,-1], MARGIN = 2, FUN = sd)
score.cor<-predict(pca.cor, newdata = set1)
> #head(score.cor)
> PC3.positive<-score.cor[,3] - min(score.cor[,3]) #Bubble needs to contain all
values > 0
common.limits<-c(min(score.cor[,1:2]), max(score.cor[,1:2]))
> #Different colors for positive and negative PC #3
> pos.PC3<-score.cor[,3]>0
col.symbol<-ifelse(test = score.cor[,3]>0, yes = "red", no = "blue")
symbols(x = score.cor[,1], y = score.cor[,2], circles = PC3.positive,
xlab = "PC #1", ylab = "PC #2", main = "Principal components", inches = 0.5,
xlim = common.limits, ylim = common.limits, panel.first = grid(col =
"lightgray", lty = "dotted"), fg = col.symbol)
text(x = score.cor[,1], y = score.cor[,2])
abline(h = 0)
abline(v = 0)
> #Due to the overlapping of some points, I constructed the same plot as above but
now using identify() to identify particular points.
symbols(x = score.cor[,1], y = score.cor[,2], circles = PC3.positive,
xlab = "PC #1", ylab = "PC #2", main = "Principal components", inches = 0.5,
xlim = common.limits, ylim = common.limits, panel.first = grid(col =
"lightgray", lty = "dotted"),fg = col.symbol)
> identify(x = score.cor[,1], y = score.cor[,2])
[1] 2 7 8 22 23 24 39 40
abline(h = 0)
abline(v = 0)
library(rgl)
> plot3d(x = score.cor[,1], y = score.cor[,2], z = score.cor[,3], xlab = "PC #1",
ylab = "PC #2", zlab = "PC #3", type = "h", xlim = common.limits, ylim =
common.limits)
> plot3d(x = score.cor[,1], y = score.cor[,2], z = score.cor[,3], add = TRUE, col =
"red", size = 6)
> persp3d(x = common.limits, y = common.limits, z = matrix(data = c(0,0,0,0), nrow
= 2, ncol = 2), add = TRUE, col = "green")
grid3d(side = c("x", "y", "z"), col = "lightgray")
> score.cor2<-data.frame(Applicant = 1:N, score.cor[,1:4])
> parcoord(x = score.cor2, main = "PC parallel coordinate plot (#39 and #40
highlighted in red)", col = color.select, lwd = lwd.select)
Again, PC #1 appears to be an overall measure of the applicant. Because this is the first PC, we see the most variability among all PCs.
Applicant #42 has the lowest PC #1 score and the largest PC #2 score. This indicates that his/her ratings overall are very low while also having a large contrast between FL, AA, EXP, SUIT and SC, HON (i.e., he/she could be very honest and self-confident, but just does not have the background for the job).
Applicants #39 and #40 stand out as having the smallest PC #1 score. Using this PC alone, they would appear to be the best applicants.
Note that identifying the grouping of observations was not necessary for this problem (it was for the goblet data due to its stated goal).
e)(5 points) Suppose a late applicant submits his/her application after the PCA has been completed. The applicant receives 10’s for all 15 original variables! Through using the previous PCA results, discuss how this particular individual would compare to the other applicant.
I chose these ratings for the new applicant because this would seem to be a “perfect” applicant. I put the applicant’s ratings into a new data frame and predicted the PC scores:
new.app<-data.frame(FL = 10, APP = 10, AA = 10, LA = 10, SC = 10,
LC = 10, HON = 10, SMS = 10, EXP = 10, DRV = 10, AMB = 10, GSP = 10,
POT = 10, KJ = 10, SUIT = 10)
new.PC<-predict(pca.cor, newdata = new.app)
new.PC
Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
[1,] -5.022355 1.257912 -0.2854768 0.9633 0.1075786
Comp.6 Comp.7 Comp.8 Comp.9 Comp.10
[1,] 0.1703098 0.7593572 -0.5198497 -0.397986 0.6378478
Comp.11 Comp.12 Comp.13 Comp.14 Comp.15
[1,] 0.1678422 0.08579758 -0.2559453 0.001761864 0.02957071
We can see that this applicant has the smallest PC #1 value. Notice that PC #2 – #4 are not necessarily the largest or smallest among the other individuals.
This set of scores indicates the types of scores we want to look for with respect to the other applicants in order to choose who to hire.
f)(10 points) If the overall goal is to find the best individuals to hire, where each of the 15 criteria are given equal weighting, suggest which applicants are the best. Remember that it is most desirable to score as high as possible among the fifteen variables, so you need to take this into account when using the PCA to make your judgments.
order(score.cor[,1])
[1] 40 39 23 8 22 2 24 7 9 10 3 16 44 12 20 1 46 17
[19] 11 21 6 27 38 5 37 45 13 14 4 18 26 19 15 32 31 36
[37] 25 33 30 43 34 35 41 28 47 29 48 42
score.cor[c(39,40),1:4]
Comp.1 Comp.2 Comp.3 Comp.4
[1,] -4.231745 1.359829 -0.7904450 0.3170062
[2,] -4.448426 1.313393 -0.6482829 0.1812216
symbols(x = score.cor[,1], y = score.cor[,2], circles = PC3.positive,
xlab = "PC #1", ylab = "PC #2", main = "Principal components", inches = 0.5,
xlim = c(-6, 6), ylim = c(-6, 6), panel.first = grid(col = "lightgray", lty =
"dotted"), fg = col.symbol)
> identify(x = score.cor[,1], y = score.cor[,2])
[1] 2 7 8 22 23 24 39 40
abline(h = 0)
abline(v = 0)
text(x = new.PC[,1], y = new.PC[, 2], labels = "YES!")