Data in R (And S)

161.323 2004

Institute of Information Sciences and Technology

Massey University

R: Distances between cases, and cluster analysis

Library

The functions that you will need to use for finding distances and performing cluster analysis are contained in a package called “mva”. It is possible that this package will already have been loaded. To check, type

> search()

If the “mva” package is not listed, type

> library("mva")

Example data set

We will use the Cars data set to illustrate.

> cars <- read.table("Cars.text", header=TRUE)

To print information about the first 4 cars, type

> cars[1:4,]

The names of the cars are saved in the first variable (Brand, a character variable). It is best to get these names stored as row names for the table for later commands. We could do this with

> row.names(cars) <- cars$Brand

but it is easier to modify the original command that read the data, telling it that the first column contains the car names.

> cars <- read.table("Cars.text", header=TRUE, row.names=1)

We will continue as though the data were read in this way (so there is no Brand variable).

Distances

Although the most common way to save and read data sets in R is with data frames, many of the statistical functions do not work on data frames. Instead, you must first convert a data frame into a matrix before passing it to a statistical function. The function “as.matrix()” does the conversion.

> cars.m <- as.matrix(cars[3:8])

Note that “cars[3:8]” is a data frame containing variables 3 to 8 of the original data frame (from Reliability to Cylinders). A matrix must be made from only numeric variables so we cannot include the variable Country.

Before finding distances, we should standardise the variables. (Otherwise the distance measure will be dominated by the variables with biggest standard deviation – Wt in this example.)

> cars.std.m <- scale(cars.m, center=T, scale=T)

We can now find distances with the command

> cars.dist <- dist(cars.std.m, method="euclidean")

This distance matrix is big so don’t try printing it!

Options for the method parameter are…

euclidean ordinary Euclidean distance

manhattan city-block distance

binary the simple matching index given on section 5.5 of Manly. It is only relevant to 0/1 variables.

canberra not mentioned in section 5.4 of Manly, but can be used for proportions.

A distance matrix is a different type of R object from an ordinary matrix. If you have created a square matrix of distances by other means, you can convert them into a distance matrix with the command

> my.dist <- as.dist(my.square.matrix)

Cluster analysis

Cluster analysis is based on a distance matrix, such as that produced by the dist() command. Performing a hierarchical cluster analysis has two stages. Firstly the command hclust() is used to perform the analysis.

> cluster.results <- hclust(cars.dist, method="single")

The ‘method’ parameter describes how the distance between two clusters is defined from the individual distances. Possible values include..

single distance between nearest neighbours

complete distance between furthest neighbours

average averages distances between pairs in the two clusters

centroid distance between the centroids of the two groups

The results of a cluster analysis are usually displayed in a dendrogram. This is produced with the command plclust()

> plclust(cluster.results)

Another display option (which I prefer) is to use the parameter “hang=-1”. This extends all dendrogram branches down to 0.

> plclust(cluster.results, hang=-1)

If you have produced your distance matrix from a data set with its row.names set, the dendrogram will be labelled with these row names. If your distance matrix does not contain row names, the dendrogram branches will be labelled with the numbers 1, 2, …. This can be avoided by providing a vector of names for the individuals in the labels parameter. For example, if you had read the cars data set without the “row.names=1” parameter, you could have got the car names printed in the dendrogram with the command

> plclust(cluster.results, labels=cars$Brand)

To see the countries, try

> plclust(cluster.results, labels=cars$Country)

– 2 –