Gibbs' inequality explained

thumb|200px|Josiah Willard GibbsIn information theory, Gibbs' inequality is a statement about the information entropy of a discrete probability distribution. Several other bounds on the entropy of probability distributions are derived from Gibbs' inequality, including Fano's inequality.It was first presented by J. Willard Gibbs in the 19th century.

Gibbs' inequality

Suppose that

P=\{p₁,\ldots,p_n\}

and

Q=\{q₁,\ldots,q_n\}

are discrete probability distributions. Then

	n
\sum
	i=1

p_ilogp_i\leq-

	n
\sum
	i=1

p_ilogq_i

with equality if and only if

p_i=q_i

for

i=1,...n

.^[1] Put in words, the information entropy of a distribution

is less than or equal to its cross entropy with any other distribution

The difference between the two quantities is the Kullback–Leibler divergence or relative entropy, so the inequality can also be written:^[2]

D_KL(P\|Q)\equiv

	n
\sum
	i=1

p_ilog

	p_i
	q_i

\geq0.

Note that the use of base-2 logarithms is optional, and allows one to refer to the quantity on each side of the inequality as an "average surprisal" measured in bits.

Proof

For simplicity, we prove the statement using the natural logarithm, denoted by, since

log_ba=

	lna
	lnb

so the particular logarithm base that we choose only scales the relationship by the factor .

Let

denote the set of all

for which p_i is non-zero. Then, since

lnx\leqx-1

for all x > 0, with equality if and only if x=1, we have:

-\sum_ip_iln

	q_i
	p_i

\geq-\sum_ip_i\left(

	q_i
	p_i

-1\right)

=-\sum_iq_i+\sum_ip_i=-\sum_iq_i+1\geq0

The last inequality is a consequence of the p_i and q_i being part of a probability distribution. Specifically, the sum of all non-zero values is 1. Some non-zero q_i, however, may have been excluded since the choice of indices is conditioned upon the p_i being non-zero. Therefore, the sum of the q_i may be less than 1.

So far, over the index set

, we have:

-\sum_ip_iln

	q_i
	p_i

\geq0

or equivalently

-\sum_ip_ilnq_i\geq-\sum_ip_ilnp_i

Both sums can be extended to all

i=1,\ldots,n

, i.e. including

p_i=0

, by recalling that the expression

plnp

tends to 0 as

tends to 0, and

(-lnq)

tends to

infty

tends to 0. We arrive at

	n
\sum
	i=1

p_ilnq_i\geq-

	n
\sum
	i=1

p_ilnp_i

For equality to hold, we require

	q_i
	p_i

for all

i\inI

so that the equality

	q_i
	p_i

	q_i
	p_i

-1

holds,

\sum_iq_i=1

which means

q_i=0

i\notinI

, that is,

q_i=0

p_i=0

This can happen if and only if

p_i=q_i

for

i=1,\ldots,n

Alternative proofs

The result can alternatively be proved using Jensen's inequality, the log sum inequality, or the fact that the Kullback-Leibler divergence is a form of Bregman divergence.

Proof by Jensen's inequality

Because log is a concave function, we have that:

\sum_ip_ilog

	q_i
	p_i

\lelog\sum_i

i	q_i
	p_i

=log\sum_iq_i\le0

Where the first inequality is due to Jensen's inequality, and the last equality is due to the same reason given in the above proof.

Furthermore, since

log

is strictly concave, by the equality condition of Jensen's inequality we get equality when

	q₁
	p₁

	q₂
	p₂

= … =

	q_n
	p_n

and

\sum_iq_i=1

Suppose that this ratio is

\sigma

, then we have that

1=\sum_iq_i=\sum_i\sigmap_i=\sigma

Where we use the fact that

p,q

are probability distributions. Therefore, the equality happens when

p=q

Proof by Bregman divergence

Alternatively, it can be proved by noting that $q - p - p\ln\frac qp \geq 0$ for all

p,q>0

, with equality holding iff

p=q

. Then, sum over the states, we have

\sum_i q_i - p_i - p_i\ln\frac \geq 0

with equality holding iff

p=q

This is because the KL divergence is the Bregman divergence generated by the function

t\mapstolnt

Corollary

The entropy of

is bounded by:^[1]

H(p_1,\ldots,p_n)\leqlogn.

The proof is trivial – simply set

q_i=1/n

for all i.

Notes and References

Book: Pierre Bremaud. An Introduction to Probabilistic Modeling. 6 December 2012. Springer Science & Business Media. 978-1-4612-1046-7.
Book: David J. C. MacKay. Information Theory, Inference and Learning Algorithms. 25 September 2003. Cambridge University Press. 978-0-521-64298-9.

Gibbs' inequality explained