Determining the number of groups while clustering categorical data

Cláudia Silvestre, ESCS, Portugal

Margarida Cardoso, BRU-UNIDE, ISCTE-IUL, Portugal

Mário Figueiredo, IT, IST, Portugal

Abstract: Cluster analysis for categorical data has been an active area of research. A well-known problem is to determine the number of clusters which is unknown and must be inferred from the data.

In order to estimate the number of clusters one often resorts to information criteria like Bayesian Information Criteria (BIC) or Akaike Information Criteria (AIC), Minimum Message Length (MML) and Integrated Completed Likelihood (ICL) – e.g. see (Biernacki et al, 2000). In the present work we adopt the approach developed by Figueiredo et al. (2002) for clustering continuous data. They use a MML criterion (Wallace and Boulton, 1968) to select the number of clusters and a EM algorithm to estimate the model parameters. This EM variant seamlessly integrates the model estimation and selection in a single algorithm. For clustering categorical data we assume it is derived from a finite mixture of multinomial distributions and implement a new EM algorithm, following a previous version (Silvestre et al., 2008).

Results obtained with synthetic data sets are encouraging. The main advantage of approach when compared to the above mentioned information criteria appears to be the speed of execution, which is especially relevant when dealing with large data sets.

Keywords

cluster analysis, model selection, EM algorithm, categorical variables

References

Biernacki, C., Celeux, G. and Govaert, G. (2000). Assessing a Mixture model for Clustering with the integrated Completed Likelihood. IEEE Transactions on Pattern analysis and Machine Intelligence 22, 719–725.

Figueiredo, M. A. T. and Jain, A. K. (2002). Unsupervised Learning of Finite Mixture Models. IEEE Transactions on Pattern Analysis and Machine Intelligence 24, 381-396.

Silvestre, C., Figueiredo, M., and Cardoso, M. (2008). Clustering with Finite Mixture Models and Categorical Variables. Proceedings of the International Conference on Computational Statistics – COMPSTAT’2008, Physica-Verlag, Porto, Portugal.

Wallace, C. and Boulton, D. (1968). An information measure for classification. The Computer Journal,

11:195-209.