Evidence lower bound explained
In variational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound[1] or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.
The ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g.
) which models a set of data. The actual log-likelihood may be higher (indicating an even better fit to the distribution) because the ELBO includes a
Kullback-Leibler divergence (KL divergence) term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall. Thus improving the ELBO score indicates either improving the likelihood of the model
or the fit of a component internal to the model, or both, and the ELBO score makes a good
loss function, e.g., for training a deep neural network to improve both the model overall and the internal component. (The internal component is
, defined in detail later in this article.)
Definition
Let
and
be
random variables,
jointly distributed with distribution
. For example,
is the
marginal distribution of
, and
is the
conditional distribution of
given
. Then, for a sample
, and any distribution
, the ELBO is defined as
The ELBO can equivalently be written as
[2]
In the first line,
is the
entropy of
, which relates the ELBO to the
Helmholtz free energy.
[3] In the second line,
is called the
evidence for
, and
DKL(q\phi(z|x)||p\theta(z|x))
is the
Kullback-Leibler divergence between
and
. Since the Kullback-Leibler divergence is non-negative,
forms a lower bound on the evidence (
ELBO inequality)
Motivation
Variational Bayesian inference
Suppose we have an observable random variable
, and we want to find its true distribution
. This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find
exactly, forcing us to search for a good
approximation.That is, we define a sufficiently large parametric family
\{p\theta\}\theta\in\Theta
of distributions, then solve for
for some loss function
. One possible way to solve this is by considering small variation from
to
, and solve for
L(p\theta,p*)-L(p\theta+\delta,p*)=0
. This is a problem in the
calculus of variations, thus it is called the
variational method.
Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider implicitly parametrized probability distributions:
- First, define a simple distribution
over a latent random variable
. Usually a normal distribution or a uniform distribution suffices.
- Next, define a family of complicated functions
(such as a deep neural network) parametrized by
.
- Finally, define a way to convert any
into a simple distribution over the observable random variable
. For example, let
have two outputs, then we can define the corresponding distribution over
to be the normal distribution
.
This defines a family of joint distributions
over
. It is very easy to sample
: simply sample
, then compute
, and finally sample
using
.
In other words, we have a generative model for both the observable and the latent.Now, we consider a distribution
good, if it is a close approximation of
:
since the distribution on the right side is over
only, the distribution on the left side must marginalize the latent variable
away.
In general, it's impossible to perform the integral
p\theta(x)=\intp\theta(x|z)p(z)dz
, forcing us to perform another approximation.
Since
p\theta(x)=
| p\theta(x|z)p(z) |
p\theta(z|x) |
(
Bayes' Rule), it suffices to find a good approximation of
. So define another distribution family
and use it to approximate
. This is a
discriminative model for the latent.
The entire situation is summarized in the following table:
!
: observable!
!
: latentp*(x) ≈ p\theta(x) ≈
| p\theta(x|z)p(z) | q\phi(z|x) |
approximable | |
, easy |
|
| z)p(z), easy | |
| x) \approx q_\phi(z | x) approximable | |
| z), easy | |
In
Bayesian language,
is the observed evidence, and
is the latent/unobserved. The distribution
over
is the
prior distribution over
,
is the likelihood function, and
is the
posterior distribution over
.
Given an observation
, we can
infer what
likely gave rise to
by computing
. The usual Bayesian method is to estimate the integral
p\theta(x)=\intp\theta(x|z)p(z)dz
, then compute by
Bayes' rule p\theta(z|x)=
| p\theta(x|z)p(z) |
p\theta(x) |
. This is expensive to perform in general, but if we can simply find a good approximation
q\phi(z|x) ≈ p\theta(z|x)
for most
, then we can infer
from
cheaply. Thus, the search for a good
is also called
amortized inference.
All in all, we have found a problem of variational Bayesian inference.
Deriving the ELBO
A basic result in variational inference is that minimizing the Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood:where
is the
entropy of the true distribution. So if we can maximize
, we can minimize
, and consequently find an accurate approximation
.
To maximize
, we simply sample many
, i.e. use
importance samplingwhere
is the number of samples drawn from the true distribution. This approximation can be seen as overfitting.
In order to maximize
, it's necessary to find
:
This usually has no closed form and must be estimated. The usual way to estimate integrals is
Monte Carlo integration with
importance sampling:
where
is a sampling distribution over
that we use to perform the Monte Carlo integration.
So we see that if we sample
, then
is an unbiased estimator of
. Unfortunately, this does not give us an unbiased estimator of
, because
is nonlinear. Indeed, we have by
Jensen's inequality,
In fact, all the obvious estimators of
are biased downwards, because no matter how many samples of
we take, we have by Jensen's inequality:
Subtracting the right side, we see that the problem comes down to a biased estimator of zero:
At this point, we could branch off towards the development of an importance-weighted autoencoder, but we will instead continue with the simplest case with
:
The tightness of the inequality has a closed form:
We have thus obtained the ELBO function:
Maximizing the ELBO
For fixed
, the optimization
max\theta,L(\phi,\theta;x)
simultaneously attempts to maximize
and minimize
DKL(q\phi( ⋅ |x)\|p\theta( ⋅ |x))
. If the parametrization for
and
are flexible enough, we would obtain some
, such that we have simultaneously
Sincewe haveand soIn other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model
and an accurate discriminative model
q\hat\phi( ⋅ |x) ≈ p\hat\theta( ⋅ |x)
.
Main forms
The ELBO has many possible expressions, each with some different emphasis.
This form shows that if we sample
, then
is an
unbiased estimator of the ELBO.
This form shows that the ELBO is a lower bound on the evidence
, and that maximizing the ELBO with respect to
is equivalent to minimizing the KL-divergence from
to
.
This form shows that maximizing the ELBO simultaneously attempts to keep
close to
and concentrate
on those
that maximizes
. That is, the approximate posterior
balances between staying close to the prior
and moving towards the maximum likelihood
.
This form shows that maximizing the ELBO simultaneously attempts to keep the entropy of
high, and concentrate
on those
that maximizes
. That is, the approximate posterior
balances between being a uniform distribution and moving towards the maximum a posteriori
.
Data-processing inequality
Suppose we take
independent samples from
, and collect them in the dataset
, then we have
empirical distribution
.
Fitting
to
can be done, as usual, by maximizing the loglikelihood
:
Now, by the ELBO inequality, we can bound
, and thus
The right-hand-side simplifies to a KL-divergence, and so we get:
This result can be interpreted as a special case of the
data processing inequality.
In this interpretation, maximizing
L(\phi,\theta;D)=\sumiL(\phi,\theta;xi)
is minimizing
DKL(qD,(x,z);p\theta(x,z))
, which upper-bounds the real quantity of interest
via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.
[4] References
- Kingma. Diederik P.. Welling. Max. 2014-05-01. Auto-Encoding Variational Bayes. stat.ML. 1312.6114.
- Book: Goodfellow . Ian . Deep learning . Bengio . Yoshua . Courville . Aaron . 2016 . The MIT press . 978-0-262-03561-3 . Adaptive computation and machine learning . Cambridge, Mass . Chapter 19.
- Hinton . Geoffrey E . Zemel . Richard . 1993 . Autoencoders, Minimum Description Length and Helmholtz Free Energy . Advances in Neural Information Processing Systems . Morgan-Kaufmann . 6.
- Kingma . Diederik P. . Welling . Max . 2019-11-27 . An Introduction to Variational Autoencoders . Foundations and Trends in Machine Learning . English . 12 . 4 . Section 2.7 . 10.1561/2200000056 . 1935-8237. 1906.02691 . 174802445 .