Project #1 Answers

STAT 873

Fall 2013

Complete the following problems below. Within each part, include your R program output with code inside of it and any additional information needed to explain your answer. Note that you will need to edit your output and code in order to make it look nice after you copy and paste it into your Word document.

This problem is partially based on an example in Chapters 2 and 5 of Johnson (1998). Below is the description of the data from Johnson:

Forty-eight individuals who had applied for a job with a large firm were interviewed and rated on 15 criteria. Individuals were rated on

The form of their letter of application (FL)
Their appearance (APP)
Their academic ability (AA)
Their likability (LA)
Their self-confidence (SC)
Their lucidity (LC)
Their honesty (HON)
Their salesmanship (SMS)
Their experience (EXP)
Their drive (DRV)
Their ambition (AMB)
Their ability to grasp concepts (GSP)
Their potential (POT)
Their keenness to join (KJ)
Their suitability (SUIT)

Each criterion was evaluated on an integer scale ranging from 0 to 10 with 0 being a very low and very unsatisfactory rating, and 10 being a very high rating.

The data is in the file job_applicant.csv which is available on my website. Using this data, complete the following.

1)(10 points) Examine appropriate plots of the data and interpret them in the context of the problem. In your interpretations, make sure to specifically indicate which individuals are overall good applicants.

I used 32-bit R to complete this project.

In order to represent all 15 variables on a plot at once, the most appropriate plots are stars and parallel coordinate plots.

> set1<-read.table(file = "C:\\chris\\job_applicant.csv", header = TRUE, sep =

",")

head(set1)

Applicant FL APP AA LA SC LC HON SMS EXP DRV AMB GSP POT

1 1 6 7 2 5 8 7 8 8 3 8 9 7 5

2 2 9 10 5 8 10 9 9 10 5 9 9 8 8

3 3 7 8 3 6 9 8 9 7 4 9 9 8 6

4 4 5 6 8 5 6 5 9 2 8 4 5 8 7

5 5 6 8 8 8 4 4 9 5 8 5 5 8 8

6 6 7 7 7 6 8 7 10 5 9 6 5 8 6

KJ SUIT

1 7 10

2 8 10

3 8 10

4 6 5

5 7 7

6 6 6

win.graph(width = 11, height = 7)

stars(x = set1[,-1], draw.segments = TRUE, key.loc = c(20,10), main = "Job

applicant star plot", labels = set1$Applicant)

Because larger ratings for each question are better, we are looking for stars with large rays extending out from the center. Those stars that have this quality include 39 and 40 where only their appearance (APP) is somewhat in the middle in comparison to others. Applicants 2, 7, 8, 9, 22, 23, and 24 tend to have most of their rays being large. Applicants 28 and 29 tend to have the smallest rays overall, so they may be the least desirable to hire. There are a number of other individuals with small values for some criteria, but large values in other areas. For example, applicants 41 and 42 stand out as having a lot of experience (EXP), good form to their application letter (FL), highly suitable (SUIT), and good academic ability (AA), but they lack in other areas.

library(MASS)

parcoord(x = set1, main = "Job applicant parallel coordinate plot", col =

1:nrow(set1))

> #Highlight #39 and 40 that stood out in the stars plot

color.select<-ifelse(test = set1$Applicant == 39 | set1$Applicant == 40, yes =

"red", no = "black")

> #Also could use

> #color.select<-c(rep(x = "black", times = 38), "red", "red", rep(x = "black",

times = 8))

lwd.select<-ifelse(test = set1$Applicant == 39 | set1$Applicant == 40, yes = 2,

no = 1) #Helpful due to overlapping of some lines

parcoord(x = set1, main = "Job applicant parallel coordinate plot (#39 and #40

highlighted in red)", col = color.select, lwd = lwd.select)

library(iplots)

ipcp(vars = set1)

The second plot above shows that #39 and #40 almost always have ratings close to 10, except for APP.

One could do some brushing on the third plot. I used brushing on a few of the variables to determine if large values for a particular variable of interest tended to also coincide with large values for other variables. For example, the suitability of an applicant is likely a very important quality. When I highlight those applicants with ratings of 9 and 10, I obtain the plot below:

We see that a mix of scores can occur with some variables such as academic ability (AA), but a much smaller mix of scores occurs for some areas like lucidity (LC) and honesty (HON).

Parallel coordinate plots can be a little more difficult to interpret with this type of data due to the discreteness of the ratings. In particular one needs to realize that some observations may be hidden due to this discreteness. For example, it is difficult to know how many observations truly are at a level of 10 for honesty (HON). While I did not require students to solve this problem, I did expect students to at least identify it (discreteness problems were discussed in class).

One possible way to solve the problem described above is to jitter each of the ratings by a small amount. For example, add a simulated value from a N(0,0.1) distribution to each rating. Below is an example:

> N<-nrow(set1)

set.seed(7811)

> set1.jit<-set1[,-1] + matrix(data = rnorm(n = N*15, mean = 0, sd = 0.1), nrow =

N, ncol = 15)

parcoord(x = set1.jit, main = "Job applicant parallel coordinate plot (#39 and

#40 highlighted in red)", col = color.select, lwd = lwd.select)

Due to the jittering, one needs to be careful with judging when one observation has a “higher” rating than another observation very close to it.

One group used my parcoord2() function to plot the data in its original units. This is desirable because each variable is measured on a 0 to 10 scale.

Trellis plots can be helpful as well. Based on a plot given by a group, below is one possible plot:

library(lattice)

var.names<-names(set1)[-1] #Variable names

> set1.long<-reshape(data = set1, idvar = "Applicant",

varying = var.names, v.names = "score", timevar = "Criteria", times =

var.names, direction = "long")

row.names(set1.long)<-NULL

set1.long[set1.long$Applicant == 1,] #Applicant #1

Applicant Criteria score

1 1 FL 6

49 1 APP 7

97 1 AA 2

145 1 LA 5

193 1 SC 8

241 1 LC 7

289 1 HON 8

337 1 SMS 8

385 1 EXP 3

433 1 DRV 8

481 1 AMB 9

529 1 GSP 7

577 1 POT 5

625 1 KJ 7

673 1 SUIT 10

> #Next, I need to create a variable that nicely displays the panel titles

set1.long$Applicant2<-as.factor(paste("Applicant #", set1.long$Applicant, sep =

""))

class(set1.long$Applicant2)

[1] "factor"

> #levels(set1.long$Applicant2) #Notice ordering is not ideal, and this ordering

will be used by histogram()

set1.long$Applicant3<-factor(x = set1.long$Applicant2, levels = paste("Applicant

#", 1:48, sep = "")) #Specifies one ordering to the levels of the factor

levels(set1.long$Applicant3)

[1] "Applicant #1" "Applicant #2" "Applicant #3" "Applicant #4" "Applicant #5" "Applicant #6" "Applicant #7"

[8] "Applicant #8" "Applicant #9" "Applicant #10" "Applicant #11" "Applicant #12" "Applicant #13" "Applicant #14"

[15] "Applicant #15" "Applicant #16" "Applicant #17" "Applicant #18" "Applicant #19" "Applicant #20" "Applicant #21"

[22] "Applicant #22" "Applicant #23" "Applicant #24" "Applicant #25" "Applicant #26" "Applicant #27" "Applicant #28"

[29] "Applicant #29" "Applicant #30" "Applicant #31" "Applicant #32" "Applicant #33" "Applicant #34" "Applicant #35"

[36] "Applicant #36" "Applicant #37" "Applicant #38" "Applicant #39" "Applicant #40" "Applicant #41" "Applicant #42"

[43] "Applicant #43" "Applicant #44" "Applicant #45" "Applicant #46" "Applicant #47" "Applicant #48"

win.graph(width = 11)

histogram(x = ~ score | Applicant3, data = set1.long, type = "count", layout

= c(8,6), xlab = "Score", main = "Histograms of scores by applicant")

This plot helps to show that Applicants #39 and 40 obtain mostly 10’s.

2)This part involves performing a PCA for the data using the correlation matrix.

a)(3 points) Discuss the positive and negative aspects of using the covariance matrix for a PCA rather than the correlation matrix.

In order to receive full credit for this problem, the answer needs to be in the context of the data being analyzed for the project.

Because all of the variables are measured on the same scale, it may be of interest to use the covariance matrix rather than the correlation matrix.

Below are the standard deviations and variances for the data:

> apply(X = set1[,-1], MARGIN = 2, FUN = sd)

FL APP AA LA SC LC

2.673749 1.966023 1.987550 2.805690 2.418072 3.170048

HON SMS EXP DRV AMB GSP

2.534514 3.439381 3.308529 2.947457 2.935401 3.035254

POT KJ SUIT

3.183443 2.657036 3.300279

> apply(X = set1[,-1], MARGIN = 2, FUN = var)

FL APP AA LA SC LC

7.148936 3.865248 3.950355 7.871897 5.847074 10.049202

HON SMS EXP DRV AMB GSP

6.423759 11.829344 10.946365 8.687500 8.616578 9.212766

POT KJ SUIT

10.134309 7.059840 10.891844

Due to some differences between the above values, we see that some variables will play a larger role in the PCA than others. This may or may not be of interest. For example, is an applicant’s salesmanship (SMS) more important than academic ability (AA)? A PCA with the covariance matrix will treat SMS as more important due its variance being almost 3 times larger than the variance of AA.

Note that Johnson does the PCA with both the covariance and correlation matrices. Personally, I feel more comfortable with using the correlation matrix in this setting.

b)(5 points) Determine the number of PCs needed when using the correlation matrix.

> #Using x = set1[,-1] can be an easier way to specify the variables when there are

a lot of them

pca.cor<-princomp(x = set1[,-1], cor = TRUE, scores = TRUE)

summary(pca.cor, loadings = TRUE, cutoff = 0.0)

Importance of components:

Comp.1 Comp.2 Comp.3

Standard deviation 2.7411301 1.4339809 1.20657345

Proportion of Variance 0.5009196 0.1370867 0.09705463

Cumulative Proportion 0.5009196 0.6380064 0.73506099

Comp.4 Comp.5 Comp.6

Standard deviation 1.09448513 0.85973985 0.70326316

Proportion of Variance 0.07985985 0.04927684 0.03297194

Cumulative Proportion 0.81492084 0.86419768 0.89716961

Comp.7 Comp.8 Comp.9

Standard deviation 0.59267346 0.55668844 0.50691374

Proportion of Variance 0.02341746 0.02066013 0.01713077

Cumulative Proportion 0.92058707 0.94124720 0.95837797

Comp.10 Comp.11 Comp.12

Standard deviation 0.43001206 0.39074335 0.312350893

Proportion of Variance 0.01232736 0.01017869 0.006504205

Cumulative Proportion 0.97070533 0.98088402 0.987388228

Comp.13 Comp.14 Comp.15

Standard deviation 0.298024834 0.254230665 0.189009390

Proportion of Variance 0.005921253 0.004308882 0.002381637

Cumulative Proportion 0.993309481 0.997618363 1.000000000

Loadings:

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7

FL -0.162 0.429 -0.315 -0.094 0.114 0.621 0.171

APP -0.213 -0.035 0.023 0.262 0.870 -0.038 -0.010

AA -0.040 0.237 0.430 0.636 -0.213 0.223 0.311

LA -0.225 -0.130 -0.466 0.345 -0.129 0.112 -0.131

SC -0.290 -0.249 0.241 -0.173 0.005 0.020 0.143

LC -0.315 -0.131 0.150 -0.071 -0.207 0.175 -0.515

HON -0.158 -0.405 -0.284 0.416 -0.064 -0.304 0.144

SMS -0.324 -0.029 0.186 -0.198 0.037 -0.118 0.010

EXP -0.134 0.553 -0.083 0.068 -0.103 -0.367 -0.113

DRV -0.315 0.046 0.080 -0.156 -0.201 -0.250 0.490

AMB -0.318 -0.068 0.209 -0.199 0.163 0.113 0.201

GSP -0.331 -0.023 0.117 0.075 -0.082 0.148 -0.408

POT -0.333 0.022 0.073 0.188 -0.127 0.059 -0.016

KJ -0.259 -0.082 -0.467 -0.201 -0.112 0.075 0.247

SUIT -0.236 0.421 -0.089 -0.020 0.081 -0.414 -0.173

Comp.8 Comp.9 Comp.10 Comp.11 Comp.12 Comp.13 Comp.14

FL -0.155 0.411 0.033 -0.121 -0.185 0.032 0.091

APP 0.009 -0.071 0.014 0.114 -0.004 -0.314 -0.087

AA 0.043 -0.065 0.290 0.184 0.174 0.038 -0.011

LA 0.308 -0.369 0.172 -0.417 0.025 -0.059 0.149

SC -0.386 0.126 0.226 -0.423 0.339 0.001 -0.380

LC -0.024 0.124 0.250 -0.025 0.002 -0.421 0.113

HON -0.344 0.424 -0.007 0.051 -0.145 0.202 0.217

SMS 0.142 -0.083 0.478 0.171 -0.633 0.299 -0.032

EXP -0.584 -0.348 -0.012 -0.082 -0.126 -0.138 0.046

DRV 0.256 0.157 -0.199 0.026 -0.062 -0.558 0.225

AMB -0.041 -0.336 -0.209 -0.179 0.181 0.378 0.524

GSP -0.106 0.082 -0.335 0.450 0.163 0.109 0.161

POT 0.149 -0.051 -0.555 -0.170 -0.286 0.131 -0.557

KJ -0.052 -0.275 0.144 0.525 0.286 0.009 -0.301

SUIT 0.382 0.353 0.138 -0.083 0.396 0.292 -0.051

Comp.15

FL -0.028

APP 0.022

AA 0.068

LA -0.298

SC -0.302

LC 0.499

HON 0.173

SMS -0.171

EXP -0.001

DRV -0.171

AMB 0.302

GSP -0.527

POT 0.238

KJ 0.203

SUIT 0.099

plot(pca.cor, type = "lines", main = "Scree plot for job applicant data")

Notice the following:

There are four PCs with eigenvalues greater than 1
The scree plot tends to level off after four PCs
73.5% of the total variability in the data is accounted for by 3 PCs; 81.5% of the total variability in the data is accounted for by 4 PCs

At least three PCs are needed due to the amount of variability being explained. Four PCs may be enough to use.

It is also important to note that the first PC accounts for 50% of the total variation! Thus, only one PC is needed to account for roughly half of the “information” in the data. Given the interpretation of PC #1 (see next part), this component is very important to examine when judging whether or not to hire an applicant.

c)(8 points) Interpret the PCs chosen from b).Make sure to specifically comment on whether positive or negative scores (or scores close to 0) for a PC would likely be preferred by the firm.

To help see which of the loadings are “away from zero”, one could set the cutoff argument value in summary() to something other than 0.0. For example, below are some of the results when using a value of 0.2.

summary(pca.cor, loadings = TRUE, cutoff = 0.2)

Importance of components:

Loadings:

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7

FL 0.429 -0.315 0.621

APP -0.213 0.262 0.870

AA 0.237 0.430 0.636 -0.213 0.223 0.311

LA -0.225 -0.466 0.345

SC -0.290 -0.249 0.241

LC -0.315 -0.207 -0.515

HON -0.405 -0.284 0.416 -0.304

SMS -0.324

EXP 0.553 -0.367

DRV -0.315 -0.201 -0.250 0.490

AMB -0.318 0.209 0.201

GSP -0.331 -0.408

POT -0.333

KJ -0.259 -0.467 -0.201 0.247

SUIT -0.236 0.421 -0.414

Large values for each of the fifteen original variables would likely be preferred. For example, the larger SUIT is, the more suitable an applicant would be for the firm. This type of thinking then needs to be taken into account when interpreting the PCs. Below are my interpretations:

PC #1: All of the loadings are negative. Thus, the smaller the value of PC #1, the better the applicant. This PC appears to be an overall measure of the applicant.

PC #2: This appears to be a contrast between FL, AA, EXP, SUIT and SC, HON. The interpretation here is difficult. Notice that SC and HON appear to be character issues. AA and EXP are easily measurable qualities (e.g., received all A’s in school or has 10 years of prior experience). Perhaps FL is also easily measurable too and reflects experience (e.g., are their misspellings? Is it organized professionally?). Excluding SUIT then, the PC could be a contrast between measurable aspects and more difficult to measure aspects. It would be nice to know more about what SUIT really is in order to better interpret the PC.

PC #3: This appears to be a contrast between AA, SC, AMB and FL, LA, KJ. The interpretation here is again difficult. People who have a lot of self-confidence (SC) and ambition (AMB) are likely to do well in school (large AA). A person who is lucid (LC) is likely to have a good form to their cover letter (FL). I am not sure how KJ factors in with LC and FL.

PC #4: This appears to be a contrast between APP, AA, LA, HON and KJ. The interpretation here is again difficult. Note that 64-bit R users had their eigenvector be -1 multiplied by the eigenvector given above.

With respect to PCs #2 – #4, it is not necessarily true that large positive or small negative scores would be desirable. A later part in this project will address this more.

Overall, you can see how it can be difficult to interpret PCs. A subject matter researcher could likely judge better why particular linear combinations of the original variables make sense.

d)(5 points) Examine plots of the PC scores and interpret them in the context of the problem. For example, what do you think of applicant #42?

Below are a number of 3D and 4D plots:

> #Need to change the scale component of pca.cor list

pca.cor$scale<-apply(X = set1[,-1], MARGIN = 2, FUN = sd)

score.cor<-predict(pca.cor, newdata = set1)

> #head(score.cor)

> PC3.positive<-score.cor[,3] - min(score.cor[,3]) #Bubble needs to contain all

values > 0

common.limits<-c(min(score.cor[,1:2]), max(score.cor[,1:2]))

> #Different colors for positive and negative PC #3

> pos.PC3<-score.cor[,3]>0

col.symbol<-ifelse(test = score.cor[,3]>0, yes = "red", no = "blue")

symbols(x = score.cor[,1], y = score.cor[,2], circles = PC3.positive,

xlab = "PC #1", ylab = "PC #2", main = "Principal components", inches = 0.5,

xlim = common.limits, ylim = common.limits, panel.first = grid(col =

"lightgray", lty = "dotted"), fg = col.symbol)

text(x = score.cor[,1], y = score.cor[,2])

abline(h = 0)

abline(v = 0)

> #Due to the overlapping of some points, I constructed the same plot as above but

now using identify() to identify particular points.

symbols(x = score.cor[,1], y = score.cor[,2], circles = PC3.positive,

xlab = "PC #1", ylab = "PC #2", main = "Principal components", inches = 0.5,

xlim = common.limits, ylim = common.limits, panel.first = grid(col =

"lightgray", lty = "dotted"),fg = col.symbol)

> identify(x = score.cor[,1], y = score.cor[,2])

[1] 2 7 8 22 23 24 39 40

abline(h = 0)

abline(v = 0)

library(rgl)

> plot3d(x = score.cor[,1], y = score.cor[,2], z = score.cor[,3], xlab = "PC #1",

ylab = "PC #2", zlab = "PC #3", type = "h", xlim = common.limits, ylim =

common.limits)

> plot3d(x = score.cor[,1], y = score.cor[,2], z = score.cor[,3], add = TRUE, col =

"red", size = 6)

> persp3d(x = common.limits, y = common.limits, z = matrix(data = c(0,0,0,0), nrow

= 2, ncol = 2), add = TRUE, col = "green")

grid3d(side = c("x", "y", "z"), col = "lightgray")

> score.cor2<-data.frame(Applicant = 1:N, score.cor[,1:4])

> parcoord(x = score.cor2, main = "PC parallel coordinate plot (#39 and #40

highlighted in red)", col = color.select, lwd = lwd.select)

Again, PC #1 appears to be an overall measure of the applicant. Because this is the first PC, we see the most variability among all PCs.

Applicant #42 has the lowest PC #1 score and the largest PC #2 score. This indicates that his/her ratings overall are very low while also having a large contrast between FL, AA, EXP, SUIT and SC, HON (i.e., he/she could be very honest and self-confident, but just does not have the background for the job).

Applicants #39 and #40 stand out as having the smallest PC #1 score. Using this PC alone, they would appear to be the best applicants.

Note that identifying the grouping of observations was not necessary for this problem (it was for the goblet data due to its stated goal).

e)(5 points) Suppose a late applicant submits his/her application after the PCA has been completed. The applicant receives 10’s for all 15 original variables! Through using the previous PCA results, discuss how this particular individual would compare to the other applicant.

I chose these ratings for the new applicant because this would seem to be a “perfect” applicant. I put the applicant’s ratings into a new data frame and predicted the PC scores:

new.app<-data.frame(FL = 10, APP = 10, AA = 10, LA = 10, SC = 10,

LC = 10, HON = 10, SMS = 10, EXP = 10, DRV = 10, AMB = 10, GSP = 10,

POT = 10, KJ = 10, SUIT = 10)

new.PC<-predict(pca.cor, newdata = new.app)

new.PC

Comp.1 Comp.2 Comp.3 Comp.4 Comp.5

[1,] -5.022355 1.257912 -0.2854768 0.9633 0.1075786

Comp.6 Comp.7 Comp.8 Comp.9 Comp.10

[1,] 0.1703098 0.7593572 -0.5198497 -0.397986 0.6378478

Comp.11 Comp.12 Comp.13 Comp.14 Comp.15

[1,] 0.1678422 0.08579758 -0.2559453 0.001761864 0.02957071

We can see that this applicant has the smallest PC #1 value. Notice that PC #2 – #4 are not necessarily the largest or smallest among the other individuals.

This set of scores indicates the types of scores we want to look for with respect to the other applicants in order to choose who to hire.

f)(10 points) If the overall goal is to find the best individuals to hire, where each of the 15 criteria are given equal weighting, suggest which applicants are the best. Remember that it is most desirable to score as high as possible among the fifteen variables, so you need to take this into account when using the PCA to make your judgments.

order(score.cor[,1])

[1] 40 39 23 8 22 2 24 7 9 10 3 16 44 12 20 1 46 17

[19] 11 21 6 27 38 5 37 45 13 14 4 18 26 19 15 32 31 36

[37] 25 33 30 43 34 35 41 28 47 29 48 42

score.cor[c(39,40),1:4]

Comp.1 Comp.2 Comp.3 Comp.4

[1,] -4.231745 1.359829 -0.7904450 0.3170062

[2,] -4.448426 1.313393 -0.6482829 0.1812216

symbols(x = score.cor[,1], y = score.cor[,2], circles = PC3.positive,

xlab = "PC #1", ylab = "PC #2", main = "Principal components", inches = 0.5,

xlim = c(-6, 6), ylim = c(-6, 6), panel.first = grid(col = "lightgray", lty =

"dotted"), fg = col.symbol)

> identify(x = score.cor[,1], y = score.cor[,2])

[1] 2 7 8 22 23 24 39 40

abline(h = 0)

abline(v = 0)

text(x = new.PC[,1], y = new.PC[, 2], labels = "YES!")