DSCI 415 – Cluster Analysis – Assignment #5 (125pts.)
Due 10/22/17 by Midnight

Problem 1 – Tennis Racquets (review material and some new stuff)

You are working as a statistical consultant for a tennis racquet manufacturer, helping with some statistical analysis they hope can be of use in an upcoming advertising campaign. The company has selected 31 new models of racquets produced by that company and others and measured six variables, which represent various characteristics of the racquets:

  • X1 – length of racquet (in inches)
  • X2 – static weight (in ounces) – this is how much the racquet actually weighs on a scale
  • X3 – balance (in inches) – this is a measure of whether the racquet is heavier in on the head end or on the handle end; more negative values indicate a more head-heavy racquet; positive values indicate a more head-light racquet; zero indicates an even balance.
  • X4 – swingweight – this is a complicated measure of how heavy the racquet feels when it is swung.
  • X5 – headsize (in square inches) – the size of the racquet face (the strung area)
  • X6 – beamwidth (in mm) – the width of the cross-section (edge) of the racquet

The questions that the CEO of the company would like answered include:

a)Are there particular racquet(s) that are highly unusual in terms of the measured characteristics? If so, identify them. (2 pts.)

b)Are there notable associations/relationships between some of the variables? (if so, describe them and present them graphically) (3 pts.)

c)Is there a way to graphically represent the raw data for the 31 racquets and draw conclusions about the data set from such a graph? (2 pts.)

d)Can we find a few indices that describe the variation in the data set using a lesser dimension than the original set of variables? If so, what are those indices? Is there a convenient interpretation of any of the indices? (4 pts.)

e)Can we graphically display the data in a low number of dimensions using such indices? What conclusions about the racquets (individual racquets or groups of racquets) can you draw from such a graph? (3 pts.)

f)It is believed that novice players prefer lighter, wider, and bigger-faced racquets that are easier to swing, more forgiving, and more powerful (although sacrificing touch and control). What useful information, e.g., for an ad campaign, could be gleaned from this data set as related to this belief? (4 pts.)

g)Can you find groups/clusters of similar racquets in terms of their characteristics? If so, can you find a group of racquets that would be good novice players using the specifications listed in part f? (6 pts.)

Problem 2 – Boston Housing Data

The Boston Housing data set was the basis for a 1978 paper by Harrison and Rubinfeld, which discussed approaches for using housing market data to estimate the willingness to pay for clean air. The authors employed a hedonic price model, based on the premise that the price of the property is determined by structural attributes (such as size, age, condition) as well as neighborhood attributes (such as crime rate, accessibility, environmental factors). This type of approach is often used to quantify the effects of environmental factors that affect the price of a property.

We will not be modeling these data, but rather will be using MDS and cluster analysis to examine structure in these data.

Data were gathered for 506 census tracts in the Boston Standard Metropolitan Statistical Area (SMSA) in 1970, collected from a number of sources including the 1970 US Census and the Boston Metropolitan Area Planning Committee. The variables used to develop the Harrison Rubinfeld housing value equation are listed in the table below. Treat all but the St. Charles indicator variable (CHAS) as numeric for purposes of conducting MDS and cluster analysis with these data.

Variables Used in the Harrison-Rubinfeld Housing Value Equation

variable / type / definition / source
TOWN / Labeling / Name of town/suburb the census tract is located in. / 1970 U.S. Census
TOWNnum / Labeling / TOWN coded numerically
LON / Longitude of the center of the census tract. / 1970 U.S. Census Tract maps
LAT / Latitude of the center of the census tract. / 1970 U.S. Census Tract maps
CMEDV / Financial / Median value of homes in the census tract in thousands of dollars. / 1970 U.S. Census
RM / Structural / Average number of rooms in homes in the census tract. / 1970 U.S. Census
AGE / % of units built prior to 1940 / 1970 U.S. Census
B / Neighborhood / Black % of population / 1970 U.S. Census
LSTAT / % of population that is lower socioeconomic status / 1970 U.S. Census
CRIM / Crime rate / FBI (1970)
ZN / % of residential land zoned for lots > than 25,000 sq. ft. / Metro Area Planning Commission (1972)
INDUS / % of non-retail business acres (proxy for industry) / Mass. Dept. of Commerce & Development (1965)
TAX / Property tax rate / Mass. Taxpayers Foundation (1970)
PTRATIO / Pupil-Teacher ratio / Mass. Dept. of Ed (’71-‘72)
CHAS / Dummy variable indicating proximity to Charles River.
(1 = on river, 0 = not on river) / 1970 U.S. Census Tract maps
DIS / Accessibility / Weighted distances to major employment centers in area / Schnare dissertation (Unpublished, 1973)
RAD / Index of accessibility to radial highways / MIT Boston Project
NOX / Air Pollution / Nitrogen oxide concentrations (pphm) / TASSIM

DATA Reference:

Harrison, D., and Rubinfeld, D. L., “Hedonic Housing Prices and the Demand for Clean Air,” Journal of Environmental Economics and Management, 5 (1978), 81-102.

For the purposes of conducting a multidimensional scaling and cluster analysis of these data you will only be using CMEDV – NOX in the table above. It is important to note the CHAS is a dichotomous nominal/categorical variable (i.e. binary).

a)Use metric MDSwith an appropriate distance matrix for these data with .

Then construct a plot of the results labeling the points using TOWN and color coding the TOWN labels using TOWNnum. Comment on what you see in this plot. (3 pts.)

b)Does the lower dimensional representation from of the census tract characteristics from MDS seem to align with the actual spatial orientation of these census tracts?
Explain. (3 pts.)
Note: To answer this question you will need construct a plot of the census tracts using the same labeling and coloring as you did in part (a), however for this plot your X-axis will longitude (LON) and Y-axis will be latitude (LAT).

c)Given your answer to part (b), should MDS have preserved the spatial orientation of the census tracts in these data? Explain your answer. (3 pts.)

d)Use hierarchical cluster analysis (hclust or agnes) using three different linkage choices (complete, average, and Ward’s). Which linkage type do you think produces the best results? Explain/justify your answer. (5 pts.)

e)Use the silhouette function to choose an “optimal”/”reasonable” number of clusters? For your choice of k, construct a plot of the first two principal components using TOWN as a label and color-coding the points by the k clusters. What do you see? Does this choice of seem like a good one? Explain. (4 pts.)

f)Use kmeans clustering along with the function fviz_nbclust in the factoextra library to find the “optimal”/”reasonable” number of clusters k based on these variables. Provide a plot showing how you chose your value for k. (4 pts.)

library(factoextra)

fviz_nbclust(bos.mat,kmeans,k.max=15,method=”gap_stat”)

fviz_nbclust(bos.mat,kmeans,k.max=15,method=”silhouette”)  poopy!

fviz_nbclust(bos.mat,kmeans,k.max=15,method=”wss”)

g)Use longitude and latitude to plot the census tracts again using TOWN as a label and color-coding the points by cluster found using k-means clustering with the choice of chosen in part (f). To what extent are cluster spatially aligned? Should they be? Explain. (3 pts.)

h)Using the clust.grps function in the notes to summarize your clusters using the data matrix in the original scale (i.e. not the scaled version you used to perform the clustering). I do not expect you discuss them all! Instead choose two clusters you feel look very distinct from one another and discuss their apparent differences on these attributes. (4 pts.)

i)Perform PCA using the scaled variables and construct a plot of the results again labeling the census tracts by TOWN and color code the points according to the clusters returned by kmeans. Are the clusters distinct in this plot? Should they be? (4 pts.)

j)Perform two-way clustering using the cim function in the mixOmics library. Discuss what interesting things you see from this plot. (4 pts.)

Problem 3 – Satellite Images and Soil Type
One frame of Landsat MSS imagery consists of four digital images of the same scene in different spectral bands. Two of these are in the visible region (corresponding approximately to green and red regions of the visible spectrum) and two are in the (near) infra-red. Each pixel is an 8-bit binary word, with 0 corresponding to black and 255 to white. The spatial resolution of a pixel is about 80m x 80m. Each image contains such pixels.

The database is a (tiny) sub-area of a scene, consisting of pixels. Each line of data corresponds to a square neighborhood of pixels completely contained within the sub-area. Each line contains the pixel values in the four spectral bands of each of the 9 pixels in the neighborhood and a nominal variable indicating the soil type of the central pixel.

The soil types are of one of the following classes: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble, and very damp grey soil.

The data is given in random order and certain lines of data have been removed so you cannot reconstruct the original image from this dataset. In each line of data the four spectral values for the top-left pixel (TL1 – TL4) are given first followed by the four spectral values for the top-middle pixel (TM1 – TM4) and then those for the top-right pixel (TR1 – TR4), and so on with the pixels read out in sequence left-to-right and top-to-bottom. These data are contained in the files SatImage.JMP and SatImage.txt (comma delimited).

a)Use multidimensional scaling and/or PCA to plot these data in dimensions using color-coding and text labels to show soil type. Are the 6 distinct soil types well delineated in this plot? (4 pts.)

b)Use some form(s) of cluster analysis to form clusters. How do the clusters match the 6 soil types in these data? (6 pts.)

table(myclusters,SATimage$Type)  here myclusters contains the clusters you created.

c)Repeat part (a) but this time color-code and label points by cluster. Are the clusters well delineated in this plot? (4 pts.)

d)Just because we know these satellite images are of 6 different soil types, it does not necessarily mean that the “optimal” number of clusters for these data. Determine the “optimal” number of clusters for a method of clustering of your choosing. Then construct a lower dimensional representation of these data color-coding by cluster. Discuss. (6 pts.)

e)Use the cim function in the mixOmics package to perform two-way clustering for these data. Can you visualize the clusters you found in part (d)? Explain. (4 pts.)

Problem 4–NHL Forwards

Using the NHL forwards data from Assignment 3, perform a thorough cluster analysis of these players. (40 pts.)

Definition of Thorough:

  • Tried different distance metrics
  • Tried different linkages (if hierarchical clustering)
  • Tried different clustering methods (e.g. k-means vs. hierarchical)
  • Chose an “optimal” number of clusters somehow and explained how it was obtained.
  • Discussed some of the clusters in terms of what makes them distinct on the basis of the variables used and what players are in them. For example, is there an elite player cluster? A 4th line thugs cluster? Etc.
  • Graphics showing lower dimensional representation of the data with clustering results displayed. Two-way clustering should be part of your arsenal here.