Variational Bayesian methods are a family of techniques for approximating intractable integrals arising in Bayesian inference and machine learning. They are typically used in complex statistical models consisting of observed variables (usually termed "data") as well as unknown parameters and latent variables, with various sorts of relationships among the three types of random variables, as might be described by a graphical model. As typical in Bayesian inference, the parameters and latent variables are grouped together as "unobserved variables". Variational Bayesian methods are primarily used for two purposes:
In the former purpose (that of approximating a posterior probability), variational Bayes is an alternative to Monte Carlo sampling methods—particularly, Markov chain Monte Carlo methods such as Gibbs sampling—for taking a fully Bayesian approach to statistical inference over complex distributions that are difficult to evaluate directly or sample. In particular, whereas Monte Carlo techniques provide a numerical approximation to the exact posterior using a set of samples, variational Bayes provides a locally-optimal, exact analytical solution to an approximation of the posterior.
Variational Bayes can be seen as an extension of the expectation–maximization (EM) algorithm from maximum likelihood (ML) or maximum a posteriori (MAP) estimation of the single most probable value of each parameter to fully Bayesian estimation which computes (an approximation to) the entire posterior distribution of the parameters and latent variables. As in EM, it finds a set of optimal parameter values, and it has the same alternating structure as does EM, based on a set of interlocked (mutually dependent) equations that cannot be solved analytically.
For many applications, variational Bayes produces solutions of comparable accuracy to Gibbs sampling at greater speed. However, deriving the set of equations used to update the parameters iteratively often requires a large amount of work compared with deriving the comparable Gibbs sampling equations. This is the case even for many models that are conceptually quite simple, as is demonstrated below in the case of a basic non-hierarchical model with only two parameters and no latent variables.
In variational inference, the posterior distribution over a set of unobserved variables
Z=\{Z1...Zn\}
X
Q(Z):
P(Z\midX) ≈ Q(Z).
The distribution
Q(Z)
P(Z\midX)
Q(Z)
P(Z\midX)
The similarity (or dissimilarity) is measured in terms of a dissimilarity function
d(Q;P)
Q(Z)
d(Q;P)
The most common type of variational Bayes uses the Kullback–Leibler divergence (KL-divergence) of Q from P as the choice of dissimilarity function. This choice makes this minimization tractable. The KL-divergence is defined as
DKL(Q\parallelP)\triangleq\sumZQ(Z)log
Q(Z) | |
P(Z\midX) |
.
Note that Q and P are reversed from what one might expect. This use of reversed KL-divergence is conceptually similar to the expectation–maximization algorithm. (Using the KL-divergence in the other way produces the expectation propagation algorithm.)
Variational techniques are typically used to form an approximation for:
P(Z\midX)=
P(X\midZ)P(Z) | |
P(X) |
=
P(X\midZ)P(Z) | |
\intZP(X,Z')dZ' |
The marginalization over
Z
P(X)
Z
Q(Z) ≈ P(Z\midX)
See main article: Evidence lower bound.
Given that
P(Z\midX)=
P(X,Z) | |
P(X) |
DKL(Q\parallelP)=\sumZQ(Z)\left[log
Q(Z) | |
P(Z,X) |
+logP(X)\right] =\sumZQ(Z)\left[logQ(Z)-logP(Z,X)\right]+\sumZQ(Z)\left[logP(X)\right]
Because
P(X)
Z
\sumZQ(Z)=1
Q(Z)
DKL(Q\parallelP)=\sumZQ(Z)\left[logQ(Z)-logP(Z,X)\right]+logP(X)
which, according to the definition of expected value (for a discrete random variable), can be written as follows
DKL(Q\parallelP)=EQ\left[logQ(Z)-logP(Z,X)\right]+logP(X)
which can be rearranged to become
logP(X)= DKL(Q\parallelP)-EQ\left[logQ(Z)-logP(Z,X)\right]=DKL(Q\parallelP)+l{L}(Q)
As the log-evidence
logP(X)
Q
l{L}(Q)
Q
P
Q
l{L}(Q)
Q
P(Z\midX)
l{L}(Q)
logP(X)
The lower bound
l{L}(Q)
\operatorname{E}Q[logP(Z,X)]
Q
l{L}(Q)
By the generalized Pythagorean theorem of Bregman divergence, of which KL-divergence is a special case, it can be shown that:[1]
DKL(Q\parallelP)\geqDKL(Q\parallelQ*)+DKL(Q*\parallelP),\forallQ*\inl{C}
where
l{C}
Q=Q*\triangleq\argminQ\inl{C
In this case, the global minimizer
Q*(Z)=q*(Z1\midZ
* | |
2)q |
(Z2)=q*(Z2\midZ
* | |
1)q |
(Z1),
Z=\{Z1,Z2\}, |
q*(Z2)=
P(X) | |
\zeta(X) |
P(Z2\midX) | |
\exp(DKL(q*(Z1\midZ2)\parallelP(Z1\midZ2,X))) |
=
1 | |
\zeta(X) |
\expE | \left(log | |
q*(Z1\midZ2) |
P(Z,X) | |
q*(Z1\midZ2) |
\right),
in which the normalizing constant is:
\zeta(X)
=P(X)\int | |
Z2 |
P(Z2\midX) | |
\exp(DKL(q*(Z1\midZ2)\parallelP(Z1\midZ2,X))) |
=
\int | |
Z2 |
\expE | \left(log | |
q*(Z1\midZ2) |
P(Z,X) | |
q*(Z1\midZ2) |
\right).
The term
\zeta(X)
P(X)\geq\zeta(X)=\exp(l{L}(Q*))
By interchanging the roles of
Z1
Z2,
q*(Z1)
q*(Z2)
P(Z1\midX)
P(Z2\midX),
Q*
DKL(Q\parallelP)
If the constrained space
l{C}
q*(Z1\midZ2)=q*
(Z1), |
Q*(Z)=q*
* | |
(Z | |
1)q |
(Z2),
The variational distribution
Q(Z)
Z
Z1...ZM
Q(Z)=
M | |
\prod | |
i=1 |
qi(Zi\midX)
It can be shown using the calculus of variations (hence the name "variational Bayes") that the "best" distribution
* | |
q | |
j |
qj
* | |
q | |
j |
(Zj\midX)=
| ||||||||||||||||
where
\operatorname{E} | |||||||
|
[lnp(Z,X)]
q*
* | |
q | |
j |
(Zj\midX)
In practice, we usually work in terms of logarithms, i.e.:
ln
* | |
q | |
j |
(Zj\midX)=
\operatorname{E} | |||||||
|
[lnp(Z,X)]+constant
The constant in the above expression is related to the normalizing constant (the denominator in the expression above for
* | |
q | |
j |
Using the properties of expectations, the expression
\operatorname{E} | |||||||
|
[lnp(Z,X)]
Zj
In other words, for each of the partitions of variables, by simplifying the expression for the distribution over the partition's variables and examining the distribution's functional dependency on the variables in question, the family of the distribution can usually be determined (which in turn determines the value of the constant). The formula for the distribution's parameters will be expressed in terms of the prior distributions' hyperparameters (which are known constants), but also in terms of expectations of functions of variables in other partitions. Usually these expectations can be simplified into functions of expectations of the variables themselves (i.e. the means); sometimes expectations of squared variables (which can be related to the variance of the variables), or expectations of higher powers (i.e. higher moments) also appear. In most cases, the other variables' distributions will be from known families, and the formulas for the relevant expectations can be looked up. However, those formulas depend on those distributions' parameters, which depend in turn on the expectations about other variables. The result is that the formulas for the parameters of each variable's distributions can be expressed as a series of equations with mutual, nonlinear dependencies among the variables. Usually, it is not possible to solve this system of equations directly. However, as described above, the dependencies suggest a simple iterative algorithm, which in most cases is guaranteed to converge. An example will make this process clearer.
The following theorem is referred to as a duality formula for variational inference.[3] It explains some important properties of the variational distributions used in variational Bayes methods.
(\Theta,l{F},P)
(\Theta,l{F},Q)
Q\llP
λ
P\llλ
Q\llλ
h
(\Theta,l{F},P)
h\inL1(P)
logEP[\exph]=supQ\{EQ[h]-DKL(Q\parallelP)\}.
Further, the supremum on the right-hand side is attained if and only if it holds
q(\theta) | |
p(\theta) |
=
\exph(\theta) | |
EP[\exph] |
,
almost surely with respect to probability measure
Q
p(\theta)=dP/dλ
q(\theta)=dQ/dλ
P
Q
λ
Consider a simple non-hierarchical Bayesian model consisting of a set of i.i.d. observations from a Gaussian distribution, with unknown mean and variance.[5] In the following, we work through this model in great detail to illustrate the workings of the variational Bayes method.
For mathematical convenience, in the following example we work in terms of the precision — i.e. the reciprocal of the variance (or in a multivariate Gaussian, the inverse of the covariance matrix) — rather than the variance itself. (From a theoretical standpoint, precision and variance are equivalent since there is a one-to-one correspondence between the two.)
We place conjugate prior distributions on the unknown mean
\mu
\tau
\begin{align} \tau&\sim\operatorname{Gamma}(a0,b0)\\ \mu|\tau&\siml{N}(\mu0,(λ0\tau)-1)\\ \{x1,...,xN\}&\siml{N}(\mu,\tau-1)\\ N&=numberofdatapoints \end{align}
The hyperparameters
\mu0,λ0,a0
b0
\mu
\tau
We are given
N
X=\{x1,\ldots,xN\}
q(\mu,\tau)=p(\mu,\tau\midx1,\ldots,xN)
\mu
\tau.
The joint probability of all variables can be rewritten as
p(X,\mu,\tau)=p(X\mid\mu,\tau)p(\mu\mid\tau)p(\tau)
where the individual factors are
\begin{align} p(X\mid\mu,\tau)&=
N | |
\prod | |
n=1 |
l{N}(xn\mid\mu,\tau-1)\\ p(\mu\mid\tau)&=l{N}\left(\mu\mid\mu0,(λ0\tau)-1\right)\\ p(\tau)&=\operatorname{Gamma}(\tau\mida0,b0) \end{align}
where
\begin{align} l{N}(x\mid\mu,\sigma2)&=
1 | |
\sqrt{2\pi\sigma2 |
Assume that
q(\mu,\tau)=q(\mu)q(\tau)
\mu
\tau
Then
\begin{align} ln
*(\mu) | |
q | |
\mu |
&=\operatorname{E}\tau\left[lnp(X\mid\mu,\tau)+lnp(\mu\mid\tau)+lnp(\tau)\right]+C\\ &=\operatorname{E}\tau\left[lnp(X\mid\mu,\tau)\right]+\operatorname{E}\tau\left[lnp(\mu\mid\tau)\right]+\operatorname{E}\tau\left[lnp(\tau)\right]+C\\ &=\operatorname{E}\tau\left[ln
N | |
\prod | |
n=1 |
l{N}\left(xn\mid\mu,\tau-1\right)\right]+\operatorname{E}\tau\left[lnl{N}\left(\mu\mid\mu0,(λ0\tau)-1\right)\right]+C2\\ &=\operatorname{E}\tau\left[ln
N | ||
\prod | \sqrt{ | |
n=1 |
\tau | |
2\pi |
In the above derivation,
C
C2
C3
\mu
\operatorname{E}\tau[lnp(\tau)]
\mu
\mu
The last line is simply a quadratic polynomial in
\mu
*(\mu) | |
q | |
\mu |
*(\mu) | |
q | |
\mu |
With a certain amount of tedious math (expanding the squares inside of the braces, separating out and grouping the terms involving
\mu
\mu2
\mu
\begin{align} ln
*(\mu) | |
q | |
\mu |
&=-
\operatorname{E | |
\tau |
[\tau]}{2}\left\{
N | |
\sum | |
n=1 |
2 | |
(x | |
n-\mu) |
+λ0(\mu-\mu
2 | |
0) |
\right\}+C3\\ &=-
\operatorname{E | |
\tau |
[\tau]}{2}\left\{
N | |
\sum | |
n=1 |
2-2x | |
(x | |
n\mu |
+\mu2)+
2-2\mu | |
λ | |
0\mu |
+
2) | |
\mu | |
0 |
\right\}+C3\\ &=-
\operatorname{E | |
\tau |
[\tau]}{2}\left\{
N | |
\left(\sum | |
n=1 |
N | |
x | |
n=1 |
xn\right)\mu+\left(
N | |
\sum | |
n=1 |
\mu2\right)+
2-2λ | |
λ | |
0\mu |
0\mu+λ0\mu
2 | |
0 |
\right\}+C3\\ &=-
\operatorname{E | |
\tau |
[\tau]}{2}\left\{
2 | |
(λ | |
0+N)\mu |
-2\left(λ0\mu0+
N | |
\sum | |
n=1 |
xn\right)\mu+
N | |
\left(\sum | |
n=1 |
2\right) | |
x | |
n |
+λ0\mu
2 | |
0 |
\right\}+C3\\ &=-
\operatorname{E | |
\tau |
[\tau]}{2}\left\{
2 | |
(λ | |
0+N)\mu |
-2\left(λ0\mu0+
N | |
\sum | |
n=1 |
xn\right)\mu\right\}+C4\\ &=-
\operatorname{E | |
\tau |
[\tau]}{2}\left\{
2 | ||
(λ | -2\left( | |
0+N)\mu |
| |||||||||||||
λ0+N |
\right)(λ0+N)\mu\right\}+C4\\ &=-
\operatorname{E | |
\tau |
[\tau]}{2}\left\{
2 | ||
(λ | -2\left( | |
0+N)\left(\mu |
| |||||||||||||
λ0+N |
\right)\mu\right)\right\}+C4\\ &=-
\operatorname{E | |
\tau |
[\tau]}{2}\left\{
2 | ||
(λ | -2\left( | |
0+N)\left(\mu |
| |||||||||||||
λ0+N |
\right)\mu+\left(
| |||||||||||||
λ0+N |
\right)2-\left(
| |||||||||||||
λ0+N |
\right)2\right)\right\}+C4\\ &=-
\operatorname{E | |
\tau |
[\tau]}{2}\left\{
2 | ||
(λ | -2\left( | |
0+N)\left(\mu |
| |||||||||||||
λ0+N |
\right)\mu+\left(
| |||||||||||||
λ0+N |
\right)2\right)\right\}+C5\\ &=-
\operatorname{E | |
\tau |
[\tau]}{2}\left\{
(λ | ||||||||||||||||
|
\right)2\right\}+C5\\ &=-
1 | |
2 |
(λ0+N)\operatorname{E}\tau[\tau]\left(\mu-
| |||||||||||||
λ0+N |
\right)2+C5 \end{align}
Note that all of the above steps can be shortened by using the formula for the sum of two quadratics.
In other words:
*(\mu) | |
\begin{align} q | |
\mu |
&\siml{N}(\mu\mid\muN,λ
-1 | |
N |
)\\ \muN&=
λ0\mu0+N\bar{x | |
The derivation of
*(\tau) | |
q | |
\tau |
\begin{align} ln
*(\tau) | |
q | |
\tau |
&=\operatorname{E}\mu[lnp(X\mid\mu,\tau)+lnp(\mu\mid\tau)]+lnp(\tau)+constant\\ &=(a0-1)ln\tau-b0\tau+
1 | |
2 |
ln\tau+
N | |
2 |
ln\tau-
\tau | |
2 |
\operatorname{E}\mu\left[
N | |
\sum | |
n=1 |
2 | |
(x | |
n-\mu) |
+λ0(\mu-
2 | |
\mu | |
0) |
\right]+constant \end{align}
Exponentiating both sides, we can see that
*(\tau) | |
q | |
\tau |
*(\tau) | |
\begin{align} q | |
\tau |
&\sim\operatorname{Gamma}(\tau\midaN,bN)\\ aN&=a0+
N+1 | |
2 |
\\ bN&=b0+
1 | |
2 |
\operatorname{E}\mu
N | |
\left[\sum | |
n=1 |
2 | |
(x | |
n-\mu) |
+λ0(\mu-
2\right] \end{align} | |
\mu | |
0) |
Let us recap the conclusions from the previous sections:
*(\mu) | |
\begin{align} q | |
\mu |
&\siml{N}(\mu\mid\muN,λ
-1 | |
N |
)\\ \muN&=
λ0\mu0+N\bar{x | |
and
*(\tau) | |
\begin{align} q | |
\tau |
&\sim\operatorname{Gamma}(\tau\midaN,bN)\\ aN&=a0+
N+1 | |
2 |
\\ bN&=b0+
1 | |
2 |
\operatorname{E}\mu
N | |
\left[\sum | |
n=1 |
2 | |
(x | |
n-\mu) |
+λ0(\mu-
2\right] \end{align} | |
\mu | |
0) |
In each case, the parameters for the distribution over one of the variables depend on expectations taken with respect to the other variable. We can expand the expectations, using the standard formulas for the expectations of moments of the Gaussian and gamma distributions:
\begin{align} \operatorname{E}[\tau\midaN,bN]&=
aN | |
bN |
\\ \operatorname{E}\left[\mu\mid\muN,λ
-1 | |
N |
\right]&=\muN\\ \operatorname{E}\left[X2\right]&=\operatorname{Var}(X)+(\operatorname{E}[X])2\\ \operatorname{E}\left
2\mid\mu | |
[\mu | |
N,λ |
-1 | |
N |
\right]&=
-1 | |
λ | |
N |
+
2 \end{align} | |
\mu | |
N |
Applying these formulas to the above equations is trivial in most cases, but the equation for
bN
\begin{align} bN&=b0+
1 | |
2 |
\operatorname{E}\mu
N | |
\left[\sum | |
n=1 |
2 | |
(x | |
n-\mu) |
+λ0(\mu-
2\right] | |
\mu | |
0) |
\\ &=b0+
1 | |
2 |
\operatorname{E}\mu\left[
2 | |
(λ | |
0+N)\mu |
-2\left(λ0\mu0+
N | |
\sum | |
n=1 |
xn\right)\mu+
N | |
\left(\sum | |
n=1 |
2 | |
x | |
n |
\right)+λ0\mu
2 | |
0 |
\right]\\ &=b0+
1 | |
2 |
\left[(λ0+N)\operatorname{E}
2] | |
\mu[\mu |
-2\left(λ0\mu0+
N | |
\sum | |
n=1 |
xn\right)\operatorname{E}\mu[\mu]+\left
N | |
(\sum | |
n=1 |
2 | |
x | |
n |
\right)+λ0\mu
2 | |
0 |
\right]\\ &=b0+
1 | |
2 |
\left[(λ0+N)\left
-1 | |
(λ | |
N |
+
2 | |
\mu | |
N |
\right)-2\left(λ0\mu0+
N | |
\sum | |
n=1 |
xn\right)\muN+
N | |
\left(\sum | |
n=1 |
2 | |
x | |
n |
\right)+λ0\mu
2 | |
0 |
\right]\\ \end{align}
We can then write the parameter equations as follows, without any expectations:
\begin{align} \muN&=
λ0\mu0+N\bar{x | |
Note that there are circular dependencies among the formulas for
λN
bN
N | |
\sum | |
n=1 |
xn
N | |
\sum | |
n=1 |
2. | |
x | |
n |
\muN
aN.
λN
λN,
bN
bN,
λN
We then have values for the hyperparameters of the approximating distributions of the posterior parameters, which we can use to compute any properties we want of the posterior — e.g. its mean and variance, a 95% highest-density region (the smallest interval that includes 95% of the total probability), etc.
It can be shown that this algorithm is guaranteed to converge to a local maximum.
Note also that the posterior distributions have the same form as the corresponding prior distributions. We did not assume this; the only assumption we made was that the distributions factorize, and the form of the distributions followed naturally. It turns out (see below) that the fact that the posterior distributions have the same form as the prior distributions is not a coincidence, but a general result whenever the prior distributions are members of the exponential family, which is the case for most of the standard distributions.
The above example shows the method by which the variational-Bayesian approximation to a posterior probability density in a given Bayesian network is derived:
X
\boldsymbol\Theta
Z
p(Z,\boldsymbol\Theta\midX)
Z1,\ldots,ZM
Zj
* | |
q | |
j |
(Zj\midX)
ln
* | |
q | |
j |
(Zj\midX)=\operatorname{E}i[lnp(Z,X)]+constant
Zj
Zj
Due to all of the mathematical manipulations involved, it is easy to lose track of the big picture. The important things are:
Variational Bayes (VB) is often compared with expectation–maximization (EM). The actual numerical procedure is quite similar, in that both are alternating iterative procedures that successively converge on optimum parameter values. The initial steps to derive the respective procedures are also vaguely similar, both starting out with formulas for probability densities and both involving significant amounts of mathematical manipulations.
However, there are a number of differences. Most important is what is being computed.
Imagine a Bayesian Gaussian mixture model described as follows:[2]
\begin{align} \pi&\sim\operatorname{SymDir}(K,\alpha0)\\ Λi=1&\siml{W}(W0,\nu0)\\ \mui=1&\siml{N}(\mu0,(\beta0
-1 | |
Λ | |
i) |
)\\ z[i=1...N]&\sim\operatorname{Mult}(1,\pi)\\ xi=1&\sim
l{N}(\mu | |
zi |
,
{Λ | |
zi |
Note:
K
\alpha0
l{W}
K
l{N}
The interpretation of the above variables is as follows:
X=\{x1,...,xN\}
N
D
Z=\{z1,...,zN\}
znk
k=1...K
\pi
K
\mui=1
Λi=1
The joint probability of all variables can be rewritten as
p(X,Z,\pi,\mu,Λ)=p(X\midZ,\mu,Λ)p(Z\mid\pi)p(\pi)p(\mu\midΛ)p(Λ)
where the individual factors are
\begin{align} p(X\midZ,\mu,Λ)&=
N | |
\prod | |
n=1 |
K | |
\prod | |
k=1 |
l{N}(xn\mid\muk,Λ
-1 | |
k |
znk | |
) |
\\ p(Z\mid\pi)&=
N | |
\prod | |
n=1 |
K | |
\prod | |
k=1 |
znk | |
\pi | |
k |
\\ p(\pi)&=
\Gamma(K\alpha0) | ||||||
|
K | |
\prod | |
k=1 |
\alpha0-1 | |
\pi | |
k |
\\ p(\mu\midΛ)&=
K | |
\prod | |
k=1 |
l{N}(\muk\mid\mu0,(\beta0
-1 | |
Λ | |
k) |
)\\ p(Λ)&=
K | |
\prod | |
k=1 |
l{W}(Λk\midW0,\nu0) \end{align}
where
\begin{align} l{N}(x\mid\mu,\Sigma)&=
1 | |
(2\pi)D/2 |
1 | |
|\Sigma|1/2 |
\exp\left\{-
1 | |
2 |
(x-\mu)\rm\Sigma-1(x-\mu)\right\}\\ l{W}(Λ\midW,\nu)&=B(W,\nu)|Λ|(\nu-D-1)/2\exp\left(-
1 | |
2 |
\operatorname{Tr}(W-1Λ)\right)\\ B(W,\nu)&=|W|-\nu/2\left\{2\nu\piD(D-1)/4
D | ||
\prod | \Gamma\left( | |
i=1 |
\nu+1-i | |
2 |
\right)\right\}-1\\ D&=dimensionalityofeachdatapoint \end{align}
Assume that
q(Z,\pi,\mu,Λ)=q(Z)q(\pi,\mu,Λ)
Then[2]
\begin{align} lnq*(Z)&=\operatorname{E}\pi,\mu,Λ[lnp(X,Z,\pi,\mu,Λ)]+constant\\ &=\operatorname{E}\pi[lnp(Z\mid\pi)]+\operatorname{E}\mu,Λ[lnp(X\midZ,\mu,Λ)]+constant\\ &=
N | |
\sum | |
n=1 |
K | |
\sum | |
k=1 |
znkln\rhonk+constant \end{align}
where we have defined
ln\rhonk=\operatorname{E}[ln\pik]+
1 | |
2 |
\operatorname{E}[ln|Λk|]-
D | |
2 |
ln(2\pi)-
1 | |
2 |
\operatorname{E} | |
\muk,Λk |
[(xn-
\rmT | |
\mu | |
k) |
Λk(xn-\muk)]
Exponentiating both sides of the formula for
lnq*(Z)
q*(Z)\propto
N | |
\prod | |
n=1 |
K | |
\prod | |
k=1 |
znk | |
\rho | |
nk |
Requiring that this be normalized ends up requiring that the
\rhonk
k
q*(Z)=
N | |
\prod | |
n=1 |
K | |
\prod | |
k=1 |
znk | |
r | |
nk |
where
rnk=
\rhonk | |||||||||
|
In other words,
q*(Z)
zn
rnk
k=1...K
Furthermore, we note that
\operatorname{E}[znk]=rnk
which is a standard result for categorical distributions.
Now, considering the factor
q(\pi,\mu,Λ)
q(\pi)
K | |
\prod | |
k=1 |
q(\muk,Λk)
Then,
\begin{align} lnq*(\pi)&=lnp(\pi)+\operatorname{E}Z[lnp(Z\mid\pi)]+constant\\ &=(\alpha0-1)
K | |
\sum | |
k=1 |
ln\pik+
N | |
\sum | |
n=1 |
K | |
\sum | |
k=1 |
rnkln\pik+constant \end{align}
Taking the exponential of both sides, we recognize
q*(\pi)
q*(\pi)\sim\operatorname{Dir}(\alpha)
where
\alphak=\alpha0+Nk
where
Nk=
N | |
\sum | |
n=1 |
rnk
Finally
ln
*(\mu | |
q | |
k,Λ |
k)=lnp(\muk,Λk)+
N | |
\sum | |
n=1 |
\operatorname{E}[znk]lnl{N}(xn\mid\muk,Λ
-1 | |
k |
)+constant
Grouping and reading off terms involving
\muk
Λk
*(\mu | |
q | |
k,Λ |
k)=l{N}(\muk\midmk,(\betak
-1 | |
Λ | |
k) |
)l{W}(Λk\midWk,\nuk)
given the definitions
\begin{align} \betak&=\beta0+Nk\\ mk&=
1 | |
\betak |
(\beta0\mu0+Nk{\bar{x
Finally, notice that these functions require the values of
rnk
\rhonk
\operatorname{E}[ln\pik]
\operatorname{E}[ln|Λk|]
\operatorname{E} | |
\muk,Λk |
[(xn-
\rmT | |
\mu | |
k) |
Λk(xn-\muk)]
\begin{align} \operatorname{E} | |
\muk,Λk |
[(xn-
\rmT | |
\mu | |
k) |
Λk(xn-\muk)]&=
-1 | |
D\beta | |
k |
+\nuk(xn-
\rmT | |
m | |
k) |
Wk(xn-mk)\\ ln{\widetilde{Λ}}k&\equiv\operatorname{E}[ln|Λk|]=
D | |
\sum | |
i=1 |
\psi\left(
\nuk+1-i | |
2 |
\right)+Dln2+ln|Wk|\\ ln{\widetilde{\pi}}k&\equiv\operatorname{E}\left[ln|\pik|\right]=\psi(\alphak)-
K | |
\psi\left(\sum | |
i=1 |
\alphai\right) \end{align}
These results lead to
rnk\propto{\widetilde{\pi}}k
1/2 | |
{\widetilde{Λ}} | |
k |
\exp\left\{-
D | |
2\betak |
-
\nuk | |
2 |
(xn-
\rmT | |
m | |
k) |
Wk(xn-mk)\right\}
These can be converted from proportional to absolute values by normalizing over
k
Note that:
\betak
mk
Wk
\nuk
\muk
Λk
Nk
{\bar{x
Sk
rnk
\alpha1
\pi
Nk
rnk
rnk
\betak
mk
Wk
\nuk
Wk
\nuk
\alpha1
{\widetilde{\pi}}k
{\widetilde{Λ}}k
This suggests an iterative procedure that alternates between two steps:
rnk
rnk
Note that these steps correspond closely with the standard EM algorithm to derive a maximum likelihood or maximum a posteriori (MAP) solution for the parameters of a Gaussian mixture model. The responsibilities
rnk
p(Z\midX)
Nk
{\bar{x
Sk
Note that in the previous example, once the distribution over unobserved variables was assumed to factorize into distributions over the "parameters" and distributions over the "latent data", the derived "best" distribution for each variable was in the same family as the corresponding prior distribution over the variable. This is a general result that holds true for all prior distributions derived from the exponential family.