Pinsker's inequality explained

In information theory, Pinsker's inequality, named after its inventor Mark Semenovich Pinsker, is an inequality that bounds the total variation distance (or statistical distance) in terms of the Kullback–Leibler divergence.The inequality is tight up to constant factors.^[1]

Formal statement

Pinsker's inequality states that, if

and

are two probability distributions on a measurable space

(X,\Sigma)

, then

\delta(P,Q)\le\sqrt{

	1
	2

D_KL(P\parallelQ)},

where

\delta(P,Q)=\supl\{|P(A)-Q(A)|\mid A\in\Sigmaisameasurableeventr\}

is the total variation distance (or statistical distance) between

and

D_KL(P\parallelQ)=\operatorname{E}_P\left(log

	dP
	dQ

\right)=\int_X\left(log

	dP
	dQ

\right)dP

is the Kullback–Leibler divergence in nats. When the sample space

is a finite set, the Kullback–Leibler divergence is given by

D_KL(P\parallelQ)=\sum_i\left(log

	P(i)
	Q(i)

\right)P(i)

\|P-Q\|

of the signed measure

P-Q

, Pinsker's inequality differs from the one given above by a factor of two:

\|P-Q\|\le\sqrt{2D_KL(P\parallelQ)}.

A proof of Pinsker's inequality uses the partition inequality for f-divergences.

Alternative version

Note that the expression of Pinsker inequality depends on what basis of logarithm is used in the definition of KL-divergence.

D_KL

is defined using

(logarithm in base

), whereas

is typically defined with

log₂

(logarithm in base 2). Then,

D(P\parallelQ)=

	D_KL(P\parallelQ)
	ln2

Given the above comments, there is an alternative statement of Pinsker's inequality in some literature that relates information divergence to variation distance:

D(P\parallelQ)=

	D_KL(P\parallelQ)
	ln2

\ge

	1
	2ln2

V^2(p,q),

i.e.,

\sqrt{	D_KL(P\parallelQ)
	2

} \ge

	V(p,q)
	2

in which

V(p,q)=\sum_x

} |p(x) - q(x) | is the (non-normalized) variation distance between two probability density functions

and

on the same alphabet

l{X}

.^[2]

This form of Pinsker's inequality shows that "convergence in divergence" is a stronger notion than "convergence in variation distance".

A simple proof by John Pollard is shown by letting

r(x)=P(x)/Q(x)-1\ge-1

\begin{align} D_KL(P\parallelQ) &=E_{Q[(1+r(x))log(1+r(x))-r(x)]}\\&\ge

	1
	2

Q\left[	r(x)²
	1+r(x)/3

\right] \\&\ge

	1
	2

	2
E
	Q[\|r(x)\|]

E_Q[1+r(x)/3]

&(fromTitu'slemma) \\&=

	1
	2

	2 &(As
E
	Q[\|r(x)\|]

E_{Q[1+r(x)/3]=1}) \\&=

	1
	2

V(p,q)^{2.
\end{align}}

Here Titu's lemma is also known as Sedrakyan's inequality.

Note that the lower bound from Pinsker's inequality is vacuous for any distributions where

D_KL(P\parallelQ)>2

, since the total variation distance is at most

. For such distributions, an alternative bound can be used, due to Bretagnolle and Huber^[3] (see, also, Tsybakov^[4]):

\delta(P,Q)\le

	-D_KL(P\parallelQ)
\sqrt{1-e

History

Pinsker first proved the inequality with a greater constant. The inequality in the above form was proved independently by Kullback, Csiszár, and Kemperman.^[5]

Inverse problem

A precise inverse of the inequality cannot hold: for every

\varepsilon>0

, there are distributions

P_\varepsilon,Q

with

\delta(P_{\varepsilon,Q)\le\varepsilon}

but

D_KL(P_{\varepsilon\parallel}Q)=infty

. An easy example is given by the two-point space

\{0,1\}

with

Q(0)=0,Q(1)=1

and

P_{\varepsilon(0)}=\varepsilon,P_{\varepsilon(1)}=1-\varepsilon

.^[6]

However, an inverse inequality holds on finite spaces

with a constant depending on

.^[7] More specifically, it can be shown that with the definition

\alpha_Q:=min_xQ(x)

we have for any measure

which is absolutely continuous to

	1
	2

D_KL(P\parallelQ)\le

	1
	\alpha_Q

\delta(P,Q)^2.

As a consequence, if

has full support (i.e.

Q(x)>0

for all

x\inX

), then

\delta(P,Q)²\le

	1
	2

D(P\parallelQ)\le

	1
	\alpha_Q