Bretagnolle–Huber inequality explained

In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions

and

by a concave and bounded function of the Kullback–Leibler divergence

D_KL(P\parallelQ)

. The bound can be viewed as an alternative to the well-known Pinsker's inequality: when

D_KL(P\parallelQ)

is large (larger than 2 for instance.^[1]), Pinsker's inequality is vacuous, while Bretagnolle–Huber remains bounded and hence non-vacuous. It is used in statistics and machine learning to prove information-theoretic lower bounds relying on hypothesis testing^[2] 　(Bretagnolle–Huber–Carol Inequality is a variation of Concentration inequality for multinomially distributed random variables which bounds the total variation distance.)

Formal statement

Preliminary definitions

Let

and

be two probability distributions on a measurable space

(l{X},l{F})

.Recall that the total variation between

and

is defined by

d_TV(P,Q)=\sup_A

} \

The Kullback-Leibler divergence is defined as follows:

D_KL(P\parallelQ)=\begin{cases} \int_l{X

} \log\bigl(\frac\bigr)\, dP & \text P \ll Q, \\[1mm]+\infty & \text.\endIn the above, the notation

P\llQ

stands for absolute continuity of

with respect to

, and

	dP
	dQ

stands for the Radon–Nikodym derivative of

with respect to

General statement

The Bretagnolle–Huber inequality says:

d_TV(P,Q)\leq\sqrt{1-\exp(-D_KL(P\parallelQ))}\leq1-

	1
	2

\exp(-D_KL(P\parallelQ))

Alternative version

The following version is directly implied by the bound above but some authors^[2] prefer stating it this way. Let

A\inl{F}

be any event. Then

P(A)+Q(\bar{A})\geq

	1
	2

\exp(-D_KL(P\parallelQ))

where

\bar{A}=\Omega\smallsetminusA

is the complement of

Indeed, by definition of the total variation, for any

A\inl{F}

\begin{align} Q(A)-P(A)\leqd_TV(P,Q)&\leq1-

	1
	2

\exp(-D_KL(P\parallelQ))\\ &=Q(A)+Q(\bar{A})-

	1
	2

\exp(-D_KL(P\parallelQ)) \end{align}

Rearranging, we obtain the claimed lower bound on

P(A)+Q(\bar{A})

Proof

We prove the main statement following the ideas in Tsybakov's book (Lemma 2.6, page 89),^[3] which differ from the original proof (see C.Canonne's note for a modernized retranscription of their argument).

The proof is in two steps:

1. Prove using Cauchy–Schwarz that the total variation is related to the Bhattacharyya coefficient (right-hand side of the inequality):

	2
1-d
	TV(P,Q)

\geq\left(\int\sqrt{PQ}\right)²

2. Prove by a clever application of Jensen’s inequality that

\left(\int\sqrt{PQ}\right)²\geq\exp(-D_KL(P\parallelQ))

Step 1:

First notice that

d_TV(P,Q)=1-\intmin(P,Q)=\intmax(P,Q)-1

To see this, denote

A^*=\argmax_A\in|P(A)-Q(A)|

and without loss of generality, assume that

P(A^*)>Q(A^*)

such that

	)-Q(A*
d
	TV(P,Q)=P(A

^*)

. Then we can rewrite

d_TV(P,Q)=

\int
	A^*

max(P,Q)-

\int
	A^*

min(P,Q)

And then adding and removing

\int
	\bar{A^*

} \max(P,Q) \text \int_\min(P,Q) we obtain both identities.

Then

	2
\begin{align} 1-d
	TV(P,Q)

&=(1-d_TV(P,Q))(1+d_TV(P,Q))\\ &=\intmin(P,Q)\intmax(P,Q)\\ &\geq\left(\int\sqrt{min(P,Q)max(P,Q)}\right)²\\ &=\left(\int\sqrt{PQ}\right)²\end{align}

because

PQ=min(P,Q)max(P,Q).

Step 2:

We write

( ⋅ )^{2=\exp(2log( ⋅ ))}

and apply Jensen's inequality:

\begin{align} \left(\int\sqrt{PQ}\right)²&=\exp\left(2log\left(\int\sqrt{PQ}\right)\right)\\ &=\exp\left(2log\left(\intP\sqrt{

	Q
	P

}\right)\right) \\& =\exp\left(2\log\left(\operatorname_P \left[\left(\sqrt{\frac{P}{Q}}\right)^{-1} \, \right] \right) \right) \\& \geq \exp\left(\operatorname_P\left[-\log\left(\frac{P}{Q} \right)\right] \right) = \exp(-D_(P,Q))\end

Combining the results of steps 1 and 2 leads to the claimed bound on the total variation.

Examples of applications

Sample complexity of biased coin tosses

Source:

The question is How many coin tosses do I need to distinguish a fair coin from a biased one?

Assume you have 2 coins, a fair coin (Bernoulli distributed with mean

p_1=1/2

) and an

\varepsilon

-biased coin (

p_{2=1/2+\varepsilon}

). Then, in order to identify the biased coin with probability at least

1-\delta

(for some

\delta>0

), at least

n\geq

	1	log\left(
	2\varepsilon²

	1
	2\delta

\right).

In order to obtain this lower bound we impose that the total variation distance between two sequences of

samples is at least

1-2\delta

. This is because the total variation upper bounds the probability of under- or over-estimating the coins' means. Denote

	n
P
	1

and

	n
P
	2

the respective joint distributions of the

coin tosses for each coin, then

We have

\begin{align} (1-2\delta)²&\leqd_TV\left(P

	n,

	1

	n
P
	2

\right)²\\[4pt] &\leq

-D

	n

	1

\parallel

	n)
P
	2

KL(P

1-e

\\[4pt] &=

	-nD_KL(P₁\parallelP₂₎
1-e

\\[4pt] &=

-n	log(1/(1-4\varepsilon²⁾⁾
	2

1-e

\end{align}

The result is obtained by rearranging the terms.

Information-theoretic lower bound for k-armed bandit games

In multi-armed bandit, a lower bound on the minimax regret of any bandit algorithm can be proved using Bretagnolle–Huber and its consequence on hypothesis testing (see Chapter 15 of Bandit Algorithms^[2]).

History

The result was first proved in 1979 by Jean Bretagnolle and Catherine Huber, and published in the proceedings of the Strasbourg Probability Seminar. Alexandre Tsybakov's book features an early re-publication of the inequality and its attribution to Bretagnolle and Huber, which is presented as an early and less general version of Assouad's lemma (see notes 2.8). A constant improvement on Bretagnolle–Huber was proved in 2014 as a consequence of an extension of Fano's Inequality.^[4]

Notes and References

Canonne . Clément . 2022 . A short note on an inequality between KL and TV . math.PR . 2202.07198.
Book: Lattimore . Tor . Bandit Algorithms . Szepesvari . Csaba . 2020 . Cambridge University Press . 18 August 2022.
Book: Tsybakov, Alexandre B. . Introduction to nonparametric estimation . Springer Series in Statistics . 2010 . Springer . 10.1007/b13794 . 978-1-4419-2709-5 . 42933599 . 757859245.
Gerchinovitz . Sébastien . Ménard . Pierre . Stoltz . Gilles . 2020-05-01 . Fano's Inequality for Random Variables . Statistical Science . 35 . 2 . 10.1214/19-sts716 . 1702.05985 . 15808752 . 0883-4237.