In information theory, the Bretagnolle–Huber inequality bounds the total variation distance between two probability distributions
P
Q
DKL(P\parallelQ)
DKL(P\parallelQ)
Let
P
Q
(l{X},l{F})
P
Q
dTV(P,Q)=\supA
\ |
The Kullback-Leibler divergence is defined as follows:
DKL(P\parallelQ)=\begin{cases} \intl{X
P\llQ
P
Q
dP | |
dQ |
P
Q
The Bretagnolle–Huber inequality says:
dTV(P,Q)\leq\sqrt{1-\exp(-DKL(P\parallelQ))}\leq1-
1 | |
2 |
\exp(-DKL(P\parallelQ))
The following version is directly implied by the bound above but some authors[2] prefer stating it this way. Let
A\inl{F}
P(A)+Q(\bar{A})\geq
1 | |
2 |
\exp(-DKL(P\parallelQ))
where
\bar{A}=\Omega\smallsetminusA
A
Indeed, by definition of the total variation, for any
A\inl{F}
\begin{align} Q(A)-P(A)\leqdTV(P,Q)&\leq1-
1 | |
2 |
\exp(-DKL(P\parallelQ))\\ &=Q(A)+Q(\bar{A})-
1 | |
2 |
\exp(-DKL(P\parallelQ)) \end{align}
Rearranging, we obtain the claimed lower bound on
P(A)+Q(\bar{A})
We prove the main statement following the ideas in Tsybakov's book (Lemma 2.6, page 89),[3] which differ from the original proof (see C.Canonne's note for a modernized retranscription of their argument).
The proof is in two steps:
1. Prove using Cauchy–Schwarz that the total variation is related to the Bhattacharyya coefficient (right-hand side of the inequality):
2 | |
1-d | |
TV(P,Q) |
\geq\left(\int\sqrt{PQ}\right)2
2. Prove by a clever application of Jensen’s inequality that
\left(\int\sqrt{PQ}\right)2\geq\exp(-DKL(P\parallelQ))
First notice that
dTV(P,Q)=1-\intmin(P,Q)=\intmax(P,Q)-1
To see this, denote
A*=\argmaxA\in|P(A)-Q(A)|
P(A*)>Q(A*)
*)-Q(A | |
d | |
TV(P,Q)=P(A |
*)
dTV(P,Q)=
\int | |
A* |
max(P,Q)-
\int | |
A* |
min(P,Q)
And then adding and removing
\int | |
\bar{A* |
Then
2 | |
\begin{align} 1-d | |
TV(P,Q) |
&=(1-dTV(P,Q))(1+dTV(P,Q))\\ &=\intmin(P,Q)\intmax(P,Q)\\ &\geq\left(\int\sqrt{min(P,Q)max(P,Q)}\right)2\\ &=\left(\int\sqrt{PQ}\right)2\end{align}
because
PQ=min(P,Q)max(P,Q).
We write
( ⋅ )2=\exp(2log( ⋅ ))
\begin{align} \left(\int\sqrt{PQ}\right)2&=\exp\left(2log\left(\int\sqrt{PQ}\right)\right)\\ &=\exp\left(2log\left(\intP\sqrt{
Q | |
P |
Combining the results of steps 1 and 2 leads to the claimed bound on the total variation.
Source:
The question is How many coin tosses do I need to distinguish a fair coin from a biased one?
Assume you have 2 coins, a fair coin (Bernoulli distributed with mean
p1=1/2
\varepsilon
p2=1/2+\varepsilon
1-\delta
\delta>0
n\geq
1 | log\left( | |
2\varepsilon2 |
1 | |
2\delta |
\right).
In order to obtain this lower bound we impose that the total variation distance between two sequences of
n
1-2\delta
n | |
P | |
1 |
n | |
P | |
2 |
n
We have
\begin{align} (1-2\delta)2&\leqdTV\left(P
n, | |
1 |
n | |
P | |
2 |
\right)2\\[4pt] &\leq
| |||||||||||||||||||
1-e |
\\[4pt] &=
-nDKL(P1\parallelP2) | |
1-e |
\\[4pt] &=
| ||||
1-e |
\end{align}
In multi-armed bandit, a lower bound on the minimax regret of any bandit algorithm can be proved using Bretagnolle–Huber and its consequence on hypothesis testing (see Chapter 15 of Bandit Algorithms[2]).
The result was first proved in 1979 by Jean Bretagnolle and Catherine Huber, and published in the proceedings of the Strasbourg Probability Seminar. Alexandre Tsybakov's book features an early re-publication of the inequality and its attribution to Bretagnolle and Huber, which is presented as an early and less general version of Assouad's lemma (see notes 2.8). A constant improvement on Bretagnolle–Huber was proved in 2014 as a consequence of an extension of Fano's Inequality.[4]