Kullback's inequality explained

In information theory and statistics, Kullback's inequality is a lower bound on the Kullback–Leibler divergence expressed in terms of the large deviations rate function.[1] If P and Q are probability distributions on the real line, such that P is absolutely continuous with respect to Q, i.e. P << Q, and whose first moments exist, thenD_(P\parallel Q) \ge \Psi_Q^*(\mu'_1(P)),where

*
\Psi
Q
is the rate function, i.e. the convex conjugate of the cumulant-generating function, of

Q

, and

\mu'1(P)

is the first moment of

P.

The Cramér–Rao bound is a corollary of this result.

Proof

Let P and Q be probability distributions (measures) on the real line, whose first moments exist, and such that P << Q. Consider the natural exponential family of Q given byQ_\theta(A) = \frac = \frac \int_A e^Q(dx)for every measurable set A, where

MQ

is the moment-generating function of Q. (Note that Q0 = Q.) ThenD_(P\parallel Q) = D_(P\parallel Q_\theta) + \int_\left(\log\frac\right)\mathrm dP.By Gibbs' inequality we have

DKL(P\parallelQ\theta)\ge0

so thatD_(P\parallel Q) \ge \int_\left(\log\frac\right)\mathrm dP = \int_\left(\log\frac\right) P(dx)Simplifying the right side, we have, for every real θ where

MQ(\theta)<infty:

D_(P\parallel Q) \ge \mu'_1(P) \theta - \Psi_Q(\theta),where

\mu'1(P)

is the first moment, or mean, of P, and

\PsiQ=logMQ

is called the cumulant-generating function. Taking the supremum completes the process of convex conjugation and yields the rate function:D_(P\parallel Q) \ge \sup_\theta \left\ = \Psi_Q^*(\mu'_1(P)).

Corollary: the Cramér–Rao bound

See main article: Cramér–Rao bound.

Start with Kullback's inequality

Let Xθ be a family of probability distributions on the real line indexed by the real parameter θ, and satisfying certain regularity conditions. Then \lim_ \frac \ge \lim_ \frac,

where

*
\Psi
\theta
is the convex conjugate of the cumulant-generating function of

X\theta

and

\mu\theta+h

is the first moment of

X\theta+h.

Left side

The left side of this inequality can be simplified as follows:\begin\lim_ \frac &=\lim_ \frac 1 \int_^\infty \log \left(\frac \right) \mathrm dX_ \\&=-\lim_ \frac 1 \int_^\infty \log \left(\frac \right) \mathrm dX_ \\&=-\lim_ \frac 1 \int_^\infty \log\left(1- \left (1-\frac \right) \right) \mathrm dX_ \\&= \lim_ \frac 1 \int_^\infty \left[\left(1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) +\frac 1 2 \left(1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) ^ 2 + o \left(\left(1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right) ^ 2 \right) \right]\mathrm dX_ && \text \log(1-t) \\&= \lim_ \frac 1 \int_^\infty \left[\frac 1 2 \left(1 - \frac{\mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right)^2 \right]\mathrm dX_ \\&= \lim_ \frac 1 \int_^\infty \left[\frac 1 2 \left(\frac{\mathrm dX_{\theta+h} - \mathrm dX_\theta}{\mathrm dX_{\theta+h}} \right)^2 \right]\mathrm dX_ \\&= \frac 1 2 \mathcal I_X(\theta)\endwhich is half the Fisher information of the parameter θ.

Right side

The right side of the inequality can be developed as follows: \lim_ \frac = \lim_ \frac 1 .This supremum is attained at a value of t=τ where the first derivative of the cumulant-generating function is

\Psi'\theta(\tau)=\mu\theta+h,

but we have

\Psi'\theta(0)=\mu\theta,

so that\Psi_\theta(0) = \frac \lim_ \frac h \tau.Moreover,\lim_ \frac = \frac 1 \left(\frac \right)^2 = \frac 1 \left(\frac \right)^2.

Putting both sides back together

We have:\frac 1 2 \mathcal I_X(\theta) \ge \frac 1 \left(\frac \right)^2,which can be rearranged as:\operatorname(X_\theta) \ge \frac .

See also

Notes and references

  1. Aimé . Fuchs . Giorgio . Letta . L'inégalité de Kullback. Application à la théorie de l'estimation . Séminaire de Probabilités de Strasbourg . Séminaire de probabilités . Strasbourg . 4 . 108–131 . 1970 .