Kullback's inequality explained

In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function.^[1] If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P << Q, and whose first moments exist, then $D_(P\parallel Q) \ge \Psi_Q^*(\mu'_1(P)),$ where

	*
\Psi
	Q

is the rate function, i.e. the convex conjugate of the cumulant-generating function, of

, and

\mu'_1(P)

is the first moment of

The Cramér–Rao bound is a corollary of this result.

Proof

Let P and Q be probability distributions (measures) on the real line, whose first moments exist, and such that P << Q. Consider the natural exponential family of Q given by $Q_\theta(A) = \frac = \frac \int_A e^Q(dx)$ for every measurable set A, where

M_Q

is the moment-generating function of Q. (Note that Q₀ = Q.) Then

D_(P\parallel Q) = D_(P\parallel Q_\theta) + \int_\left(\log\frac\right)\mathrm dP.

By Gibbs' inequality we have

D_KL(P\parallelQ_\theta)\ge0

so that

D_(P\parallel Q) \ge \int_\left(\log\frac\right)\mathrm dP = \int_\left(\log\frac\right) P(dx)

Simplifying the right side, we have, for every real θ where

M_Q(\theta)<infty:

D_(P\parallel Q) \ge \mu'_1(P) \theta - \Psi_Q(\theta),

where

\mu'_1(P)

is the first moment, or mean, of P, and

\Psi_Q=logM_Q

is called the cumulant-generating function. Taking the supremum completes the process of convex conjugation and yields the rate function:

D_(P\parallel Q) \ge \sup_\theta \left\ = \Psi_Q^*(\mu'_1(P)).

Corollary: the Cramér–Rao bound

See main article: Cramér–Rao bound.

Start with Kullback's inequality

Let X_θ be a family of probability distributions on the real line indexed by the real parameter θ, and satisfying certain regularity conditions. Then $\lim_ \frac \ge \lim_ \frac,$

where

	*
\Psi
	\theta

is the convex conjugate of the cumulant-generating function of

X_\theta

and

\mu_\theta+h

is the first moment of

X_\theta+h.

Left side

The left side of this inequality can be simplified as follows: $\begin\lim_ \frac &=\lim_ \frac 1 \int_^\infty \log \left(\frac \right) \mathrm dX_ \\&=-\lim_ \frac 1 \int_^\infty \log \left(\frac \right) \mathrm dX_ \\&=-\lim_ \frac 1 \int_^\infty \log\left(1- \left (1-\frac \right) \right) \mathrm dX_ \\&= \lim_ \frac 1 \int_^\infty \left[\left(1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) +\frac 1 2 \left(1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) ^ 2
+ o \left(\left(1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) ^ 2 \right) \right]\mathrm dX_ && \text \log(1-t) \\&= \lim_ \frac 1 \int_^\infty \left[\frac 1 2 \left(1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right)^2 \right]\mathrm dX_ \\&= \lim_ \frac 1 \int_^\infty \left[\frac 1 2 \left(\frac{\mathrm dX_{\theta+h} - \mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right)^2 \right]\mathrm dX_ \\&= \frac 1 2 \mathcal I_X(\theta)\end$ which is half the Fisher information of the parameter θ.

Right side

The right side of the inequality can be developed as follows: $\lim_ \frac = \lim_ \frac 1 .$ This supremum is attained at a value of t=τ where the first derivative of the cumulant-generating function is

\Psi'_\theta(\tau)=\mu_\theta+h,

but we have

\Psi'_\theta(0)=\mu_\theta,

so that

\Psi

_\theta(0) = \frac \lim_ \frac h \tau.Moreover, $\lim_ \frac = \frac 1 \left(\frac \right)^2 = \frac 1 \left(\frac \right)^2.$

Putting both sides back together

We have: $\frac 1 2 \mathcal I_X(\theta) \ge \frac 1 \left(\frac \right)^2,$ which can be rearranged as: $\operatorname(X_\theta) \ge \frac .$

Notes and references

Aimé . Fuchs . Giorgio . Letta . L'inégalité de Kullback. Application à la théorie de l'estimation . Séminaire de Probabilités de Strasbourg . Séminaire de probabilités . Strasbourg . 4 . 108–131 . 1970 .