In statistics and machine learning, the hierarchical Dirichlet process (HDP) is a nonparametric Bayesian approach to clustering grouped data.[1] [2] It uses a Dirichlet process for each group of data, with the Dirichlet processes for all groups sharing a base distribution which is itself drawn from a Dirichlet process. This method allows groups to share statistical strength via sharing of clusters across groups. The base distribution being drawn from a Dirichlet process is important, because draws from a Dirichlet process are atomic probability measures, and the atoms will appear in all group-level Dirichlet processes. Since each atom corresponds to a cluster, clusters are shared across all groups. It was developed by Yee Whye Teh, Michael I. Jordan, Matthew J. Beal and David Blei and published in the Journal of the American Statistical Association in 2006, as a formalization and generalization of the infinite hidden Markov model published in 2002.
This model description is sourced from. The HDP is a model for grouped data. What this means is that the data items come in multiple distinct groups. For example, in a topic model words are organized into documents, with each document formed by a bag (group) of words (data items). Indexing groups by
j=1,...J
xj1,...xjn
The HDP is parameterized by a base distribution
H
j
Gj
\begin{align} Gj|G0&\sim\operatorname{DP}(\alphaj,G0) \end{align}
where
\alphaj
G0
\begin{align} G0&\sim\operatorname{DP}(\alpha0,H) \end{align}
with concentration parameter
\alpha0
H
xji
\thetaji
\begin{align} \thetaji|Gj&\simGj\\ xji|\thetaji&\simF(\thetaji) \end{align}
The first line states that each parameter has a prior distribution given by
Gj
F(\thetaji)
To understand how the HDP implements a clustering model, and how clusters become shared across groups, recall that draws from a Dirichlet process are atomic probability measures with probability one. This means that the common base distribution
G0
\begin{align} G0&=
infty | |
\sum | |
k=1 |
\pi0k
\delta | |||||||
|
\end{align}
where there are an infinite number of atoms,
* | |
\theta | |
k, |
k=1,2,...
H
\pi0k
G0
G0
Gj
G0
\begin{align} Gj&=
infty | |
\sum | |
k=1 |
\pijk
\delta | |||||||
|
\end{align}
Thus the set of atoms is shared across all groups, with each group having its own group-specific atom masses. Relating this representation back to the observed data, we see that each data item is described by a mixture model:
\begin{align} xji|Gj&\sim
infty | |
\sum | |
k=1 |
\pijk
* | |
F(\theta | |
k) \end{align} |
where the atoms
* | |
\theta | |
k |
\pijk
The HDP mixture model is a natural nonparametric generalization of Latent Dirichlet allocation, where the number of topics can be unbounded and learnt from data. Here each group is a document consisting of a bag of words, each cluster is a topic, and each document is a mixture of topics. The HDP is also a core component of the infinite hidden Markov model,[3] which is a nonparametric generalization of the hidden Markov model allowing the number of states to be unbounded and learnt from data.[4]
The HDP can be generalized in a number of directions. The Dirichlet processes can be replaced by Pitman-Yor processes and Gamma processes, resulting in the Hierarchical Pitman-Yor process and Hierarchical Gamma process. The hierarchy can be deeper, with multiple levels of groups arranged in a hierarchy. Such an arrangement has been exploited in the sequence memoizer, a Bayesian nonparametric model for sequences which has a multi-level hierarchy of Pitman-Yor processes. In addition, Bayesian Multi-Domain Learning (BMDL) model derives domain-dependent latent representations of overdispersed count data based on hierarchical negative binomial factorization for accurate cancer subtyping even if the number of samples for a specific cancer type is small.[5]