Evidence lower bound explained

In variational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound^[1] or negative variational free energy) is a useful lower bound on the log-likelihood of some observed data.

The ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g.

p(X)

) which models a set of data. The actual log-likelihood may be higher (indicating an even better fit to the distribution) because the ELBO includes a Kullback-Leibler divergence (KL divergence) term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall. Thus improving the ELBO score indicates either improving the likelihood of the model

p(X)

or the fit of a component internal to the model, or both, and the ELBO score makes a good loss function, e.g., for training a deep neural network to improve both the model overall and the internal component. (The internal component is

q_{\phi( ⋅}|x)

, defined in detail later in this article.)

Definition

Let

and

be random variables, jointly distributed with distribution

p_\theta

. For example,

p_\theta(X)

is the marginal distribution of

, and

p_\theta(Z\midX)

is the conditional distribution of

given

. Then, for a sample

x\simp_data

, and any distribution

q_\phi

, the ELBO is defined as

L(\phi, \theta; x) := \mathbb E_ \left[\ln\frac{p_\theta(x, z)}{q_\phi(z|x)} \right] .

The ELBO can equivalently be written as^[2]

$\beginL(\phi, \theta; x) = & \mathbb E_\left[\ln{} p_\theta(x, z) \right] + H[q_\phi(z|x) ] \\= & \mathbb \ln \,p_\theta(x) - D_(q_\phi(z|x) || p_\theta(z|x)) . \\\end$

In the first line,

H[q_\phi(z|x)]

is the entropy of

q_\phi

, which relates the ELBO to the Helmholtz free energy.^[3] In the second line,

lnp_\theta(x)

is called the evidence for

, and

D_KL(q_\phi(z|x)||p_\theta(z|x))

is the Kullback-Leibler divergence between

q_\phi

and

p_\theta

. Since the Kullback-Leibler divergence is non-negative,

L(\phi,\theta;x)

forms a lower bound on the evidence (ELBO inequality)

\ln p_\theta(x) \ge \mathbb \mathbb E_\left[\ln\frac{p_\theta(x, z)}{q_\phi(z\vert x)} \right].

Motivation

Variational Bayesian inference

Suppose we have an observable random variable

, and we want to find its true distribution

p^*

. This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find

p^*

exactly, forcing us to search for a good approximation.

That is, we define a sufficiently large parametric family

\{p_\theta\}_{\theta\in\Theta}

of distributions, then solve for

min_\thetaL(p_\theta,p^*)

for some loss function

. One possible way to solve this is by considering small variation from

p_\theta

, and solve for

L(p_\theta,p^*)-L(p_{\theta+\delta},p^*)=0

. This is a problem in the calculus of variations, thus it is called the variational method.

Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider implicitly parametrized probability distributions:

First, define a simple distribution

p(z)

over a latent random variable

. Usually a normal distribution or a uniform distribution suffices.

Next, define a family of complicated functions

f_\theta

(such as a deep neural network) parametrized by

\theta

Finally, define a way to convert any

f_\theta(z)

into a distribution (in general simple too, but unrelated to

p(z)

) over the observable random variable

. For example, let

f_\theta(z)=(f_1(z),f_2(z))

have two outputs, then we can define the corresponding distribution over

to be the normal distribution

lN(f_1(z),

	f_2(z)
e

)

This defines a family of joint distributions

p_\theta

over

(X,Z)

. It is very easy to sample

(x,z)\simp_\theta

: simply sample

z\simp

, then compute

f_\theta(z)

, and finally sample

x\simp_{\theta( ⋅}|z)

using

f_\theta(z)

In other words, we have a generative model for both the observable and the latent.Now, we consider a distribution

p_\theta

good, if it is a close approximation of

p^*

p_\theta(X) \approx p^*(X)

since the distribution on the right side is over

only, the distribution on the left side must marginalize the latent variable

away.
In general, it's impossible to perform the integral

p_\theta(x)=\intp_{\theta(x|z)p(z)dz}

, forcing us to perform another approximation.

Since

p_\theta(x)=

	p_{\theta(x\|z)p(z)}
	p_\theta(z\|x)

(Bayes' Rule), it suffices to find a good approximation of

p_\theta(z|x)

. So define another distribution family

q_\phi(z|x)

and use it to approximate

p_\theta(z|x)

. This is a discriminative model for the latent.

The entire situation is summarized in the following table:

: observable!

X,Z

: latent

p^{*(x) ≈}p_\theta(x) ≈

	p_{\theta(x\|z)p(z)}
	q_\phi(z\|x)

approximable

p(z)

, easy

p_\theta(x

z)p(z), easy

p_\theta(z

x) \approx q_\phi(z

x) approximable

p_\theta(x

z), easy

In Bayesian language,

is the observed evidence, and

is the latent/unobserved. The distribution

over

is the prior distribution over

p_\theta(x|z)

is the likelihood function, and

p_\theta(z|x)

is the posterior distribution over

Given an observation

, we can infer what

likely gave rise to

by computing

p_\theta(z|x)

. The usual Bayesian method is to estimate the integral

p_\theta(x)=\intp_{\theta(x|z)p(z)dz}

, then compute by Bayes' rule

p_\theta(z|x)=

	p_{\theta(x\|z)p(z)}
	p_\theta(x)

. This is expensive to perform in general, but if we can simply find a good approximation

q_\phi(z|x) ≈ p_\theta(z|x)

for most

x,z

, then we can infer

from

cheaply. Thus, the search for a good

q_\phi

is also called amortized inference.

All in all, we have found a problem of variational Bayesian inference.

Deriving the ELBO

A basic result in variational inference is that minimizing the Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood: $\mathbb_[\ln p_\theta (x)] = -H(p^*) - D_(p^*(x) \| p_\theta(x))$ where

H(p^*)=

-E
	x\simp^*

[lnp^*(x)]

is the entropy of the true distribution. So if we can maximize

E
	x\simp^*(x)

[lnp_\theta(x)]

, we can minimize

D_KL(p^*(x)\|p_\theta(x))

, and consequently find an accurate approximation

p_\theta ≈ p^*

To maximize

E
	x\simp^*(x)

[lnp_\theta(x)]

, we simply sample many

x_i\simp^*(x)

, i.e. use importance sampling

N\max_\theta \mathbb_[\ln p_\theta (x)]\approx \max_\theta \sum_i \ln p_\theta (x_i)

where

is the number of samples drawn from the true distribution. This approximation can be seen as overfitting.

In order to maximize

\sum_ilnp_\theta(x_i)

, it's necessary to find

lnp_\theta(x)

\ln p_\theta(x) = \ln \int p_\theta(x|z) p(z)dz

This usually has no closed form and must be estimated. The usual way to estimate integrals is Monte Carlo integration with importance sampling:

\int p_\theta(x|z) p(z)dz = \mathbb E_\left[\frac{p_\theta (x, z)}{q_\phi(z|x)}\right]

where

q_\phi(z|x)

is a sampling distribution over

that we use to perform the Monte Carlo integration.

So we see that if we sample

z\simq_{\phi( ⋅ |x)}

, then

	p_\theta(x,z)
	q_\phi(z\|x)

is an unbiased estimator of

p_\theta(x)

. Unfortunately, this does not give us an unbiased estimator of

lnp_\theta(x)

, because

is nonlinear. Indeed, we have by Jensen's inequality,

\ln p_\theta(x)= \ln \mathbb E_\left[\frac{p_\theta (x, z)}{q_\phi(z|x)}\right] \geq \mathbb E_\left[\ln\frac{p_\theta (x, z)}{q_\phi(z|x)}\right]

In fact, all the obvious estimators of

lnp_\theta(x)

are biased downwards, because no matter how many samples of

z_i\simq_{\phi( ⋅}|x)

we take, we have by Jensen's inequality:

\mathbb E_\left[\ln \left(\frac 1N \sum_i \frac{p_\theta (x, z_i)}{q_\phi(z_i|x)}\right)
		 \right] \leq \ln \mathbb E_\left[\frac 1N \sum_i \frac{p_\theta (x, z_i)}{q_\phi(z_i|x)}
		 \right] = \ln p_\theta(x)

Subtracting the right side, we see that the problem comes down to a biased estimator of zero:

\mathbb E_\left[\ln \left(\frac 1N \sum_i \frac{p_\theta (z_i|x)}{q_\phi(z_i|x)}\right)
		 \right] \leq 0

At this point, we could branch off towards the development of an importance-weighted autoencoder, but we will instead continue with the simplest case with

N=1

\ln p_\theta(x)= \ln \mathbb E_\left[\frac{p_\theta (x, z)}{q_\phi(z|x)}\right] \geq \mathbb E_\left[\ln\frac{p_\theta (x, z)}{q_\phi(z|x)}\right]

The tightness of the inequality has a closed form:

\ln p_\theta(x)- \mathbb E_\left[\ln\frac{p_\theta (x, z)}{q_\phi(z|x)}\right] = D_(q_\phi(\cdot | x)\| p_\theta(\cdot | x))\geq 0

We have thus obtained the ELBO function:

L(\phi, \theta; x) := \ln p_\theta(x) - D_(q_\phi(\cdot | x)\| p_\theta(\cdot | x))

Maximizing the ELBO

For fixed

, the optimization

max_\theta,L(\phi,\theta;x)

simultaneously attempts to maximize

lnp_\theta(x)

and minimize

D_KL(q_{\phi( ⋅}|x)\|p_{\theta( ⋅}|x))

. If the parametrization for

p_\theta

and

q_\phi

are flexible enough, we would obtain some

\hat\phi,\hat\theta

, such that we have simultaneously

$\ln p_(x) \approx \max_\theta \ln p_\theta(x); \quad q_(\cdot | x)\approx p_(\cdot | x)$ Since $\mathbb_[\ln p_\theta (x)] = -H(p^*) - D_(p^*(x) \| p_\theta(x))$ we have $\ln p_(x) \approx \max_\theta -H(p^*) - D_(p^*(x) \| p_\theta(x))$ and so $\hat\theta \approx \arg\min D_(p^*(x) \| p_\theta(x))$ In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model

p_\hat\theta ≈ p^*

and an accurate discriminative model

q_\hat\phi( ⋅ |x) ≈ p_\hat\theta( ⋅ |x)

Main forms

The ELBO has many possible expressions, each with some different emphasis.

$\mathbb_\left[\ln\frac{p_\theta(x, z)}{q_\phi(z|x)}\right] = \int q_\phi(z|x)\ln\fracdz$

This form shows that if we sample

z\simq_{\phi( ⋅}|x)

, then

ln	p_\theta(x,z)
	q_\phi(z\|x)

is an unbiased estimator of the ELBO.

$\ln p_\theta(x) - D_(q_\phi(\cdot | x) \;\|\; p_\theta(\cdot | x))$

This form shows that the ELBO is a lower bound on the evidence

lnp_\theta(x)

, and that maximizing the ELBO with respect to

\phi

is equivalent to minimizing the KL-divergence from

p_{\theta( ⋅}|x)

q_{\phi( ⋅}|x)

$\mathbb_[\ln p_\theta(x|z)] - D_(q_\phi(\cdot | x) \;\|\; p)$

This form shows that maximizing the ELBO simultaneously attempts to keep

q_{\phi( ⋅}|x)

close to

and concentrate

q_{\phi( ⋅}|x)

on those

that maximizes

lnp_\theta(x|z)

. That is, the approximate posterior

q_{\phi( ⋅}|x)

balances between staying close to the prior

and moving towards the maximum likelihood

\argmax_zlnp_\theta(x|z)

Data-processing inequality

Suppose we take

independent samples from

p^*

, and collect them in the dataset

D=\{x_1,...,x_N\}

, then we have empirical distribution

q_D(x)=

	1N
	\sum

\delta
	x_i

Fitting

p_\theta(x)

q_D(x)

can be done, as usual, by maximizing the loglikelihood

lnp_\theta(D)

D_(q_D(x) \| p_\theta(x)) = -\frac 1N \sum_i \ln p_\theta(x_i) - H(q_D)= -\frac 1N \ln p_\theta(D) - H(q_D)

Now, by the ELBO inequality, we can bound

lnp_\theta(D)

, and thus

D_(q_D(x) \| p_\theta(x)) \leq -\frac 1N L(\phi, \theta; D) - H(q_D)

The right-hand-side simplifies to a KL-divergence, and so we get:

D_(q_D(x) \| p_\theta(x)) \leq -\frac 1N \sum_i L(\phi, \theta; x_i) - H(q_D)= D_(q_(x, z); p_\theta(x, z))

This result can be interpreted as a special case of the data processing inequality.

In this interpretation, maximizing

L(\phi,\theta;D)=\sum_iL(\phi,\theta;x_i)

is minimizing

D_KL(q_D,(x,z);p_\theta(x,z))

, which upper-bounds the real quantity of interest

D_KL(q_D(x);p_\theta(x))

via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.^[4]

References

Kingma. Diederik P.. Welling. Max. 2014-05-01. Auto-Encoding Variational Bayes. stat.ML. 1312.6114.
Book: Goodfellow . Ian . Deep learning . Bengio . Yoshua . Courville . Aaron . 2016 . The MIT press . 978-0-262-03561-3 . Adaptive computation and machine learning . Cambridge, Mass . Chapter 19.
Hinton . Geoffrey E . Zemel . Richard . 1993 . Autoencoders, Minimum Description Length and Helmholtz Free Energy . Advances in Neural Information Processing Systems . Morgan-Kaufmann . 6.
Kingma . Diederik P. . Welling . Max . 2019-11-27 . An Introduction to Variational Autoencoders . Foundations and Trends in Machine Learning . English . 12 . 4 . Section 2.7 . 10.1561/2200000056 . 1935-8237. 1906.02691 . 174802445 .