In statistics, cluster analysis is the algorithmic grouping of objects into homogeneousgroups based on numerical measurements. Model-based clustering bases this on a statistical model for the data, usually a mixture model. This has several advantages, including a principled statistical basis for clustering,and ways to choose the number of clusters, to choose the best clustering model, to assess the uncertainty of the clustering, and to identify outliers that do not belong to any group.
Suppose that for each of
n
d
yi=(yi,1,\ldots,yi,d)
i
yi
G
p(yi)=
G | |
\sum | |
g=1 |
\taugfg(yi\mid\thetag),
fg
\thetag
\taug
G | |
\sum | |
g=1 |
\taug=1
The most common model for continuous data is that
fg
\mug
\Sigmag
\thetag=(\mug,\Sigmag)
\taug
\thetag
g=1,\ldots,G
Bayesian inference is also often used for inference about finitemixture models. The Bayesian approach also allows for the case where the number of components,
G
An advantage of model-based clustering is that it provides statisticallyprincipled ways to choose the number of clusters. Each different choice of the number of groups
G
G
For data with high dimension,
\Sigmag=λgDgAg
T | |
D | |
g |
,
Dg
\Sigmag
Ag=diag\{A1,g,\ldots,Ad,g\}
\Sigmag
λg
λg
Ag
Dg
Each of the volume, shape and orientation of the clusters can be constrained to be equal (E) or allowed to vary (V); the orientation canalso be spherical, with identical eigenvalues (I). This yields 14 possible clustering models, shown in this table:
Model | Description |
| |
---|---|---|---|
EII | Spherical, equal volume | 1 | |
VII | Spherical, varying volume | 9 | |
EEI | Diagonal, equal volume & shape | 4 | |
VEI | Diagonal, equal shape | 12 | |
EVI | Diagonal, equal volume, varying shape | 28 | |
VVI | Diagonal, varying volume & shape | 36 | |
EEE | Equal | 10 | |
VEE | Equal shape & orientation | 18 | |
EVE | Equal volume & orientation | 34 | |
VVE | Equal orientation | 42 | |
EEV | Equal volume & shape | 58 | |
VEV | Equal shape | 66 | |
EVV | Equal volume | 82 | |
VVV | Varying | 90 |
It can be seen that many of these models are more parsimonious, with far fewer parameters than the unconstrained model that has 90 parameters when
G=4
d=9
Several of these models correspond to well-known heuristic clustering methods.For example, k-means clustering is equivalent to estimation of theEII clustering model using the classification EM algorithm. The Bayesian information criterion (BIC)can be used to choose the best clustering model as well as the number of clusters. It can also be used as the basis for a method to choose the variablesin the clustering model, eliminating variables that are not useful for clustering.
Different Gaussian model-based clustering methods have been developed withan eye to handling high-dimensional data. These include the pgmm method, which is based on the mixture of factor analyzers model, and the HDclassif method, based on the idea of subspace clustering.
The mixture-of-experts framework extends model-based clustering to include covariates.
We illustrate the method with a dateset consisting of three measurements(glucose, insulin, sspg) on 145 subjects for the purpose of diagnosingdiabetes and the type of diabetes present.The subjects were clinically classified into three groups: normal,chemical diabetes and overt diabetes, but we use this information onlyfor evaluating clustering methods, not for classifying subjects.
The BIC plot shows the BIC values for each combination of the number ofclusters,
G
The classification plot shows the classification of the subjects by model-basedclustering. The classification was quite accurate, with a 12% error rateas defined by the clinical classificiation.Other well-known clustering methods performed worse with highererror rates, such as single-linkage clustering with 46%,average link clustering with 30%, complete-linkage clusteringalso with 30%, and k-means clustering with 28%.
An outlier in clustering is a data point that does not belong to any ofthe clusters. One way of modeling outliers in model-based clustering isto include an additional mixture component that is very dispersed, withfor example a uniform distribution. Another approach is to replace the multivariate normal densities by
t
t
Sometimes one or more clusters deviate strongly from the Gaussian assumption.If a Gaussian mixture is fitted to such data, a strongly non-Gaussian cluster will often be represented by several mixture components rather thana single one. In that case, cluster merging can be used to find a betterclustering. A different approach is to use mixturesof complex component densities to represent non-Gaussian clusters.
Clustering multivariate categorical data is most often done using thelatent class model. This assumes that the data arise from a finitemixture model, where within each cluster the variables are independent.
These arise when variables are of different types, suchas continuous, categorical or ordinal data. A latent class model formixed data assumes local independence between the variable. The location model relaxes the local independenceassumption. The clustMD approach assumes thatthe observed variables are manifestations of underlying continuous Gaussianlatent variables.
The simplest model-based clustering approach for multivariatecount data is based on finite mixtures with locally independent Poissondistributions, similar to the latent class model. More realistic approaches allow for dependence and overdispersion in the counts. These include methods based on the multivariate Poisson distribution,the multivarate Poisson-log normal distribution, the integer-valuedautoregressive (INAR) model and the Gaussian Cox model.
These consist of sequences of categorical values from a finite set of possibilities, such as life course trajectories. Model-based clustering approaches include group-based trajectory andgrowth mixture models and a distance-basedmixture model.
These arise when individuals rank objects in order of preference. The dataare then ordered lists of objects, arising in voting, education, marketingand other areas. Model-based clustering methods for rank data includemixtures of Plackett-Luce models and mixtures of Benter models, and mixtures of Mallows models.
These consist of the presence, absence or strength of connections between individuals or nodes, and are widespread in the social sciences and biology.The stochastic blockmodel carries out model-based clustering of the nodesin a network by assuming that there is a latent clustering and that connections are formed independently given the clustering. The latent position cluster modelassumes that each node occupies a position in an unobserved latent space,that these positions arise from a mixture of Gaussian distributions,and that presence or absence of a connection is associated with distancein the latent space.
Much of the model-based clustering software is in the form of a publiclyand freely available R package. Many of these are listed in theCRAN Task View on Cluster Analysis and Finite Mixture Models.The most used such package is, which is used to cluster continuous data and has been downloaded over 8 million times.
The package clusterscategorical data using the latent class model. The package clustersmixed data, including continuous, binary, ordinal and nominal variables.
The package does model-based clustering for a range of component distributions.The package can clusterdifferent data types. Both and implement model-based clustering with covariates.
Model-based clustering was first invented in 1950 by Paul Lazarsfeldfor clustering multivariate discrete data, in the form of the latent class model.
In 1959, Lazarsfeld gave a lecture on latent structure analysis at the University of California-Berkeley, where John H. Wolfe was an M.A. student.This led Wolfe to think about how to do the same thing for continuousdata, and in 1965 he did so, proposing the Gaussian mixture model for clustering.He also produced the first software for estimating it, called NORMIX.Day (1969), working independently, was the first to publish a journalarticle on the approach.However, Wolfe deserves credit as the inventor of model-based clusteringfor continuous data.
Murtagh and Raftery (1984) developed a model-based clustering methodbased on the eigenvalue decomposition of the component covariance matrices. McLachlan and Basford (1988) was the first book on the approach, advancing methodology and sparking interest.Banfield and Raftery (1993) coined the term "model-based clustering",introduced the family of parsimonious models, described an information criterion forchoosing the number of clusters, proposed the uniform model for outliers,and introduced the software.Celeux and Govaert (1995) showed how to perform maximum likelihood estimation for the models. Thus, by 1995 the core components of the methodology were in place, laying the groundwork for extensive development since then.
Free download: https://math.univ-cotedazur.fr/~cbouveyr/MBCbook/