SW 983 LECTURE NOTES
CLUSTER ANALYSIS
Definition: Any of several procedures in multivariate analysis designed to determine whether individuals (or other units of analysis) are similar or dissimilar enough to fall into groups or clusters.
Aka: Q analysis, typology construction, classification analysis, and numerical taxonomy.
More an art than a science
The cluster variate is the set of variables representing the characteristics used to compare objects in the cluster analysis. Cluster analysis is the only multivariate technique that does not estimate the variate empirically but instead uses the variate as specified by the researcher. The focus of cluster analysis is on the comparison of objects based on the variate, not on the estimation of the variate itself.
Cluster analysis can be characterized as descriptive, atheoretical, and noninferential. Cluster analysis has no statistical basis upon which to draw statistical inferences from a sample to a population, and it is used primarily as an exploratory technique. The solutions are not unique, as the cluster membership for any number of solutions is dependent upon many elements of the procedure and many different solutions can be obtained by varying one or more elements. Moreover, cluster analysis will always create clusters regardless of the “true existence of any structure in the data. Finally, the cluster solution is totally dependent upon the variables used as the basis for the similarity measure.
Related (but significantly different) to two other techniques:
- Factor analysis – FA groups variables while CA groups subjects. Having said that you will see that the SPSS program allows you to perform cluster analysis on variables or cases. FA is sometimes done prior to CA with the factor scores used to perform CA. Hair, et al, cites some research suggesting this may not be appropriate (p. 491). FA generally has a more theoretical basis and provides statistical tests. CA is more ad hoc.
- Discriminant analysis - DA is similar in that the goal is to classify a set of cases into groups or categories, but in CA neither the number nor the members of the groups are known. DA can be performed in conjunction with CA after the clusters are identified to derive weights for the variate or clustering variables.
Steps:
1.Look for outliers – Requires multivariate test. Mahalanobis D2 provides such a test (see p. 66 of text). Available as output variable in SPSS regression module.
2.Choose a distance measure – Squared Euclidian distance most commonly used.
3.Standardize the data (z scores) to avoid weighting importance of variables with larger variance.
4.Choose a clustering method – hierarchical agglomerative method most commonly used.
5.Determine how many groups – Could be guided by theory but is mostly a subjective judgment. Want clusters with fair representation (numbers of items), maximize between group differences and minimize within cluster differences. Distance measure provides more objective measure of this but is still relative.
6.Describe the clusters – May involve naming. What variables are hanging together for a particular cluster? Run mean or proportion comparisons across clusters. Ability to make sense of the clusters may influence your decision about how many clusters should exist.
Cluster Analysis with SPSS
SPSS provides three modules for doing cluster analysis. The TwoStep procedure is new and the easiest to run. Also available are Hierarchical and K-Means cluster procedure.
The TwoStep Cluster Analysis procedure is an exploratory tool designed to reveal natural groupings (or clusters) within a data set that would otherwise not be apparent. The algorithm employed by this procedure has several desirable features that differentiate it from traditional clustering techniques:
- Handling of categorical and continuous variables. By assuming variables to be independent, a joint multinomial-normal distribution can be placed on categorical and continuous variables.
- Automatic selection of number of clusters. By comparing the values of a model-choice criterion across different clustering solutions, the procedure can automatically determine the optimal number of clusters.
- Scalability. By constructing a cluster features (CF) tree that summarizes the records, the TwoStep algorithm allows you to analyze large data files.
Example. Retail and consumer product companies regularly apply clustering techniques to data that describe their customers' buying habits, gender, age, income level, etc. These companies tailor their marketing and product development strategies to each consumer group to increase sales and build brand loyalty.
Statistics. The procedure produces information criteria (AIC or BIC) by numbers of clusters in the solution, cluster frequencies for the final clustering, and descriptive statistics by cluster for the final clustering.
Plots. The procedure produces bar charts of cluster frequencies, pie charts of cluster frequencies, and variable importance charts.
Distance Measure. This selection determines how the similarity between two clusters is computed.
- Log-likelihood. The likelihood measure places a probability distribution on the variables. Continuous variables are assumed to be normally distributed, while categorical variables are assumed to be multinomial. All variables are assumed to be independent.
- Euclidean. The Euclidean measure is the "straight line" distance between two clusters. It can be used only when all of the variables are continuous.
Number of Clusters. This selection allows you to specify how the number of clusters is to be determined.
- Determine automatically. The procedure will automatically determine the "best" number of clusters, using the criterion specified in the Clustering Criterion group. Optionally, enter a positive integer specifying the maximum numbers of clusters that the procedure should consider.
- Specify fixed. Allows you to fix the number of clusters in the solution. Enter a positive integer.
Count of Continuous Variables. This group provides a summary of the continuous variable standardization specifications made in the Options dialog box.
Clustering Criterion. This selection determines how the automatic clustering algorithm determines the number of clusters. Either the Bayesian Information Criterion (BIC) or the Akaike Information Criterion (AIC) can be specified.
If you have a small number of cases, and want to choose between several methods for cluster formation, variable transformation, and measuring the dissimilarity between clusters, try the Hierarchical Cluster Analysis procedure. The Hierarchical Cluster Analysis procedure also allows you to cluster variables instead of cases.
The K-Means Cluster Analysis procedure is limited to continuous variables, but can be used to analyze large data sets and allows you to save the distances from cluster centers for each object. Unlike the Hierarchical Cluster Analysis procedure which results in a series of solutions corresponding to different numbers of clusters, the K-Means CA procedure produces only one solution for the number of clusters requested by the user.