Wine Dataset
Let’s load some data.
> wine <- read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/wine/wine.data",sep=",")
> wine
Once you have read a data set into R, the next step is usually to make a plot of the data.
> pairs(wine)
The first variable is numeric but only has three distinct values (which you should be able to see from the scatter plot). So let’s ignore this one for now:
> pairs(wine[2:14])
Ok, let’s look at a scatterplot on steroids:
> library(car)
> scatterplotMatrix(wine[2:14])
Let’s focus in on 5 variables:
> scatterplotMatrix(wine[2:6])
> help( scatterplotMatrix )
Ok, you should look at the arguments for this function. From it we can see that we can change the display for the diagonals. The diagonals contain information on a single variable. Try these:
> scatterplotMatrix(wine[2:6], diagonal="boxplot")
> scatterplotMatrix(wine[2:6], diagonal="histogram")
> scatterplotMatrix(wine[2:6], diagonal="qqplot")
If you see an interesting scatterplot for two variables in the matrix scatterplot, you may want to plot that scatterplot in more detail, with the data points labelled by their group (their cultivar in this case).
For example, in the matrix scatterplot above, the cell in the third column of the fourth row down is a scatterplot of V5 (x-axis) against V4 (y-axis). If you look at this scatterplot, it appears that there may be a positive relationship between V5 and V4.
We may therefore decide to examine the relationship between V5 and V4 more closely, by plotting a scatterplot of these two variable, with the data points labelled by their group (their cultivar). To plot a scatterplot of two variables, we can use the “plot” R function. The V4 and V5 variables are stored in the columns V4 and V5 of the variable “wine”, so can be accessed by typing wine$V4 or wine$V5. Therefore, to plot the scatterplot, we type:
> plot(wine$V4, wine$V5)
> text(wine$V4, wine$V5, wine$V1, cex=0.7, pos=4, col="red")
We can see from the scatterplot of V4 versus V5 that the wines from cultivar 2 seem to have lower values of V4 compared to the wines of cultivar 1.
Let’s get some summary statistics using “sapply()”. The “sapply()” function can be used to apply some other function to each column in a data frame, eg. sapply(mydataframe,sd) will calculate the standard deviation of each column in a dataframe “mydataframe”.
> sapply(wine[2:14],mean)
> sapply(wine[2:14],sd)
It is often interesting to calculate the means and standard deviations for just the samples from a particular group, for example, for the wine samples from each cultivar. The cultivar is stored in the column “V1” of the variable “wine”.
> cultivar2wine <- wine[wine$V1=="2",]
> cultivar1wine <- wine[wine$V1=="1",]
> cultivar3wine <- wine[wine$V1=="3",]
> sapply(cultivar2wine[2:14],mean)
> sapply(cultivar1wine[2:14],mean)
> sapply(cultivar3wine[2:14],mean)
If you want to compare different variables that have different units, are very different variances, it is a good idea to first standardize the variables.[1] You can standardize variables in R using the “scale()” function.
> standardisedconcentrations <- as.data.frame(scale(wine[2:14]))
The purpose of principal component analysis is to find the best low-dimensional representation of the variation in a multivariate data set. For example, in the case of the wine data set, we have 13 chemical concentrations describing wine samples from three different cultivars. We can carry out a principal component analysis to investigate whether we can capture most of the variation between samples using a smaller number of new variables (principal components), where each of these new variables is a linear combination of all or some of the 13 chemical concentrations.
> help(prcomp)
> wine.pca <- prcomp(standardisedconcentrations)
You can get a summary of the principal component analysis results using the “summary()” function on the output of “prcomp()”:
> summary(wine.pca)
This gives us the standard deviation of each component, and the proportion of variance explained by each component.
In order to decide how many principal components should be retained, it is common to summarise the results of a principal components analysis by making a scree plot, which we can do in R using the “screeplot()” function:
> screeplot(wine.pca, type="lines")
The most obvious change in slope in the scree plot occurs at component 4, which is the “elbow” of the scree plot. Therefore, it could be argued based on the basis of the scree plot that the first three components should be retained.
Another way of deciding how many components to retain is to use Kaiser’s criterion: that we should only retain principal components for which the variance is above 1 (when principal component analysis was applied to standardized data). We can check this by finding the variance of each of the principal components:
> (wine.pca$sdev)^2
Which components are above 1?
The values of the principal components are stored in a named element “x” of the variable returned by “prcomp()”. This contains a matrix with the principal components, where the first column in the matrix contains the first principal component, the second column the second component, and so on.
Thus, in our example, “wine.pca$x[,1]” contains the first principal component, and “wine.pca$x[,2]” contains the second principal component.
We can make a scatterplot of the first two principal components, and label the data points with the cultivar that the wine samples come from, by typing:
> plot(wine.pca$x[,1],wine.pca$x[,2])
> text(wine.pca$x[,1],wine.pca$x[,2], wine$V1, cex=0.7, pos=4, col="red")
What does this plot tell us? The scatterplot shows the first principal component on the x-axis, and the second principal component on the y-axis. We can see from the scatterplot that wine samples of cultivar 1 have much lower values of the first principal component than wine samples of cultivar 3. Therefore, the first principal component separates wine samples of cultivars 1 from those of cultivar 3.
We can also see that wine samples of cultivar 2 have much higher values of the second principal component than wine samples of cultivars 1 and 3. Therefore, the second principal component separates samples of cultivar 2 from samples of cultivars 1 and 3.
Therefore, the first two principal components are reasonably useful for distinguishing wine samples of the three different cultivars.
IRIS Dataset
Let’s look PCA with a smaller dataset but one which you have probably seen a few times already (it’s a common data set for testing classification algorithms etc.).
> data(iris)
> head( iris )
> scatterplotMatrix(iris, diagonal="histogram")
> scatterplotMatrix(iris[1:4], diagonal="histogram")
What if we scaled the data? Would this show any changes in the scatter plots?
> scatterplotMatrix(scale(iris[1:4]), diagonal="histogram")
All we have done is rescale data we haven’t changed it any fundamental way.
Compare the outputs here:
> cor( iris[1:4])
> cor( scale(iris[1:4]))
Let’s use the pairs function again but this time we will assign colours based on class / type.
> classholder[iris$Species == 'setosa'] = 1
> classholder[iris$Species == 'versicolor'] = 2
> classholder[iris$Species == 'virginica'] = 3
> pairs(iris[,-5], pch=23, bg=c('red', 'blue', 'yellow')[classholder])
Let’s apply PCA to the iris data:
> iris.pca <- princomp(scale(iris[-5]))
> summary(iris.pca)
Again, this gives us the standard deviation of each component, and the proportion of variance explained by each component. And in order to decide how many principal components should be retained, it is common to summarise the results of a principal components analysis by making a scree plot:
> screeplot(iris.pca, type="lines")
There are few different ways of visualizing the results of a PCA, e.g,:
> biplot(iris.pca)
Interpretation: the relative location of the points can be interpreted. Points that are close together correspond to observations that have similar scores on the components displayed in the plot. To the extent that these components fit the data well, the points also correspond to observations that have similar values on the variables.
The arrow is a vector and it points in the direction that is most like the variable represented by the vector. This is the direction which has the highest squared multiple correlation with the principal components. The length of the vector is proportional to the squared multiple correlation between the fitted values for the variable and the variable itself.
[1] The method will look at in R by default assumes a correlation matrix (with standardized data) but we use raw data values and choose a covariance matrix.
