In statistics, a latent class model (LCM) is a model for clustering multivariate discrete data. It assumes that the data arise from a mixture of discrete distributions, within each of which the variables are independent. It is called a latent class model because the class to which each data point belongs is unobserved, or latent.
Latent class analysis (LCA) is a subset of structural equation modeling, used to find groups or subtypes of cases in multivariate categorical data. These subtypes are called "latent classes".[1] [2]
Confronted with a situation as follows, a researcher might choose to use LCA to understand the data: Imagine that symptoms a-d have been measured in a range of patients with diseases X, Y, and Z, and that disease X is associated with the presence of symptoms a, b, and c, disease Y with symptoms b, c, d, and disease Z with symptoms a, c and d.
The LCA will attempt to detect the presence of latent classes (the disease entities), creating patterns of association in the symptoms. As in factor analysis, the LCA can also be used to classify case according to their maximum likelihood class membership.[1] [3]
Because the criterion for solving the LCA is to achieve latent classes within which there is no longer any association of one symptom with another (because the class is the disease which causes their association), and the set of diseases a patient has (or class a case is a member of) causes the symptom association, the symptoms will be "conditionally independent", i.e., conditional on class membership, they are no longer related.[1]
Within each latent class, the observed variables are statistically independent. This is an important aspect. Usually the observed variables are statistically dependent. By introducing the latent variable, independence is restored in the sense that within classes variables are independent (local independence). We then say that the association between the observed variables is explained by the classes of the latent variable (McCutcheon, 1987).
In one form, the latent class model is written as
p | |
i1,i2,\ldots,iN |
≈
T | |
\sum | |
t |
pt
N | |
\prod | |
n |
n | |
p | |
in, t |
,
T
pt
n | |
p | |
in,t |
For a two-way latent class model, the form is
pij ≈
T | |
\sum | |
t |
ptpitpjt.
The probability model used in LCA is closely related to the Naive Bayes classifier. The main difference is that in LCA, the class membership of an individual is a latent variable, whereas in Naive Bayes classifiers the class membership is an observed label.
There are a number of methods with distinct names and uses that share a common relationship. Cluster analysis is, like LCA, used to discover taxon-like groups of cases in data. Multivariate mixture estimation (MME) is applicable to continuous data, and assumes that such data arise from a mixture of distributions: imagine a set of heights arising from a mixture of men and women. If a multivariate mixture estimation is constrained so that measures must be uncorrelated within each distribution, it is termed latent profile analysis. Modified to handle discrete data, this constrained analysis is known as LCA. Discrete latent trait models further constrain the classes to form from segments of a single dimension: essentially allocating members to classes on that dimension: an example would be assigning cases to social classes on a dimension of ability or merit.
As a practical instance, the variables could be multiple choice items of a political questionnaire. The data in this case consists of a N-way contingency table with answers to the items for a number of respondents. In this example, the latent variable refers to political opinion and the latent classes to political groups. Given group membership, the conditional probabilities specify the chance certain answers are chosen.
LCA may be used in many fields, such as: collaborative filtering,[4] Behavior Genetics[5] and Evaluation of diagnostic tests.[6]