In probability theory, a Chernoff bound is an exponentially decreasing upper bound on the tail of a random variable based on its moment generating function. The minimum of all such exponential bounds forms the Chernoff or Chernoff-Cramér bound, which may decay faster than exponential (e.g. sub-Gaussian).[1] [2] It is especially useful for sums of independent random variables, such as sums of Bernoulli random variables.[3] [4]
The bound is commonly named after Herman Chernoff who described the method in a 1952 paper,[5] though Chernoff himself attributed it to Herman Rubin.[6] In 1938 Harald Cramér had published an almost identical concept now known as Cramér's theorem.
It is a sharper bound than the first- or second-moment-based tail bounds such as Markov's inequality or Chebyshev's inequality, which only yield power-law bounds on tail decay. However, when applied to sums the Chernoff bound requires the random variables to be independent, a condition that is not required by either Markov's inequality or Chebyshev's inequality.
The Chernoff bound is related to the Bernstein inequalities. It is also used to prove Hoeffding's inequality, Bennett's inequality, and McDiarmid's inequality.
The generic Chernoff bound for a random variable
X
etX
t
X
M(t)=\operatornameE(et)
\operatornameP\left(X\geqa\right)=\operatornameP\left(et\geqet\right)\leqM(t)e-t (t>0)
Since this bound holds for every positive
t
\operatornameP\left(X\geqa\right)\leqinftM(t)e-t
t
\operatornameP\left(X\leqa\right)=\operatornameP\left(et\geqet\right)\leqM(t)e-t (t<0)
and
\operatornameP\left(X\leqa\right)\leqinftM(t)e-t
M(t)e-t
\operatornameE(et)e-t
\operatornameE(et)
\operatornameE(et)\geet
a\le\operatornameE(X)
a\ge\operatornameE(X)
X
The logarithm of the two-sided Chernoff bound is known as the rate function (or Cramér transform)
I=-logC
K=logM
C(\operatornameE(X))=1
The Chernoff bound is exact if and only if
X
t
In practice, the exact Chernoff bound may be unwieldy or difficult to evaluate analytically, in which case a suitable upper bound on the moment (or cumulant) generating function may be used instead (e.g. a sub-parabolic CGF giving a sub-Gaussian Chernoff bound).
0 |
\sigma2t2 |
\left(
\right)2 | \exp\left({-
| |||||||||||||||||
Bernoulli distribution(detailed below) | p | ln\left(1-p+pet\right) | DKL(a\parallelp) | \left(
\right)a{\left(
\right)}1 | ||||||||||||||||
Standard Bernoulli (H is the binary entropy function) |
| ln\left(1+et\right)-ln(2) | ln(2)-H(a) |
a-a(1-a)-(1-a) | ||||||||||||||||
Rademacher distribution | 0 | ln\cosh(t) | ln(2)-H\left(
\right) | \sqrt{(1+a)-1-a(1-a)-1+a | ||||||||||||||||
Gamma distribution | \thetak | -kln(1-\thetat) |
-k+
|
\right)kek-a/\theta | ||||||||||||||||
Chi-squared distribution | k |
ln(1-2t) |
-1-ln
\right) | \left(
\right)k/2ek/2-a/2 | ||||||||||||||||
Poisson distribution | λ | λ(et-1) | aln(a/λ)-a+λ | (a/λ)-aea-λ |
Using only the moment generating function, a lower bound on the tail probabilities can be obtained by applying the Paley-Zygmund inequality to
etX
t
Theodosopoulos[9] constructed a tight(er) MGF-based lower bound using an exponential tilting procedure.
For particular distributions (such as the binomial) lower bounds of the same exponential order as the Chernoff bound are often available.
When is the sum of independent random variables, the moment generating function of is the product of the individual moment generating functions, giving that:
and:
\Pr(X\leqa)\leqinfte-ta\prodi\operatornameE
tXi | |
\left[e |
\right]
Specific Chernoff bounds are attained by calculating the moment-generating function
\operatornameE
-t ⋅ Xi | |
\left[e |
\right]
Xi
When the random variables are also identically distributed (iid), the Chernoff bound for the sum reduces to a simple rescaling of the single-variable Chernoff bound. That is, the Chernoff bound for the average of n iid variables is equivalent to the nth power of the Chernoff bound on a single variable (see Cramér's theorem).
See main article: Hoeffding's inequality.
Chernoff bounds may also be applied to general sums of independent, bounded random variables, regardless of their distribution; this is known as Hoeffding's inequality. The proof follows a similar approach to the other Chernoff bounds, but applying Hoeffding's lemma to bound the moment generating functions (see Hoeffding's inequality).
Hoeffding's inequality. Suppose are independent random variables taking values in Let denote their sum and let denote the sum's expected value. Then for any
t>0
\Pr(X\le\mu-t)<
-2t2/(n(b-a)2) | |
e |
,
\Pr(X\ge\mu+t)<
-2t2/(n(b-a)2) | |
e |
.
The bounds in the following sections for Bernoulli random variables are derived by using that, for a Bernoulli random variable
Xi
\operatornameE
t ⋅ Xi | |
\left[e |
\right]=(1-p)e0+pet=1+p(et-1)\leq
p(et-1) | |
e |
.
One can encounter many flavors of Chernoff bounds: the original additive form (which gives a bound on the absolute error) or the more practical multiplicative form (which bounds the error relative to the mean).
Multiplicative Chernoff bound. Suppose are independent random variables taking values in Let denote their sum and let denote the sum's expected value. Then for any,
\Pr(X\ge(1+\delta)\mu)\leq\left(
e\delta | |
(1+\delta)1+\delta |
\right)\mu.
\Pr(X\le(1-\delta)\mu)\leq\left(
e-\delta | |
(1-\delta)1-\delta |
\right)\mu.
The above formula is often unwieldy in practice, so the following looser but more convenient bounds[10] are often used, which follow from the inequality
|
\lelog(1+\delta)
\Pr(X\ge(1+\delta)\mu)\le
-\delta2\mu/(2+\delta) | |
e |
, 0\le\delta,
\Pr(X\le(1-\delta)\mu)\le
-\delta2\mu/2 | |
e |
, 0<\delta<1,
\Pr(|X-\mu|\ge\delta\mu)\le
-\delta2\mu/3 | |
2e |
, 0<\delta<1.
Notice that the bounds are trivial for
\delta=0
In addition, based on the Taylor expansion for the Lambert W function,[11]
\Pr(X\geR)\le2-xR, x>0, R\ge(2xe-1)\mu.
The following theorem is due to Wassily Hoeffding[12] and hence is called the Chernoff–Hoeffding theorem.
Chernoff–Hoeffding theorem. Suppose are i.i.d. random variables, taking values in Let and .
\begin{align} \Pr\left(
1 | |
n |
\sumXi\geqp+\varepsilon\right)\leq\left(\left(
p | |
p+\varepsilon |
\right)p+\varepsilon{\left(
1-p | |
1-p-\varepsilon |
\right)}1\right)n&=e-D(p+\varepsilon\parallel\\ \Pr\left(
1 | |
n |
\sumXi\leqp-\varepsilon\right)\leq\left(\left(
p | |
p-\varepsilon |
\right)p-\varepsilon{\left(
1-p | |
1-p+\varepsilon |
\right)}1\right)n&=e-D(p-\varepsilon\parallel\end{align}
where
D(x\parallely)=xln
x | |
y |
+(1-x)ln\left(
1-x | |
1-y |
\right)
is the Kullback–Leibler divergence between Bernoulli distributed random variables with parameters x and y respectively. If then
D(p+\varepsilon\parallelp)\ge\tfrac{\varepsilon2}{2p(1-p)}
\Pr\left(
1 | |
n |
\sumXi>p+x\right)\leq\exp\left(-
x2n | |
2p(1-p) |
\right).
A simpler bound follows by relaxing the theorem using, which follows from the convexity of and the fact that
d2 | |
d\varepsilon2 |
D(p+\varepsilon\parallelp)=
1 | |
(p+\varepsilon)(1-p-\varepsilon) |
\geq4=
d2 | |
d\varepsilon2 |
(2\varepsilon2).
This result is a special case of Hoeffding's inequality. Sometimes, the bounds
\begin{align} D((1+x)p\parallelp)\geq
1 | |
4 |
x2p,&&&{-\tfrac{1}{2}}\leqx\leq\tfrac{1}{2},\\[6pt] D(x\parallely)\geq
3(x-y)2 | |
2(2y+x) |
,\\[6pt] D(x\parallely)\geq
(x-y)2 | |
2y |
,&&&x\leqy,\\[6pt] D(x\parallely)\geq
(x-y)2 | |
2x |
,&&&x\geqy \end{align}
which are stronger for are also used.
Chernoff bounds have very useful applications in set balancing and packet routing in sparse networks.
The set balancing problem arises while designing statistical experiments. Typically while designing a statistical experiment, given the features of each participant in the experiment, we need to know how to divide the participants into 2 disjoint groups such that each feature is roughly as balanced as possible between the two groups.[13]
Chernoff bounds are also used to obtain tight bounds for permutation routing problems which reduce network congestion while routing packets in sparse networks.
Chernoff bounds are used in computational learning theory to prove that a learning algorithm is probably approximately correct, i.e. with high probability the algorithm has small error on a sufficiently large training data set.[14]
Chernoff bounds can be effectively used to evaluate the "robustness level" of an application/algorithm by exploring its perturbation space with randomization.[15] The use of the Chernoff bound permits one to abandon the strong—and mostly unrealistic—small perturbation hypothesis (the perturbation magnitude is small). The robustness level can be, in turn, used either to validate or reject a specific algorithmic choice, a hardware implementation or the appropriateness of a solution whose structural parameters are affected by uncertainties.
A simple and common use of Chernoff bounds is for "boosting" of randomized algorithms. If one has an algorithm that outputs a guess that is the desired answer with probability p > 1/2, then one can get a higher success rate by running the algorithm
n=log(1/\delta)2p/(p-1/2)2
1-\delta
\Pr\left[X>{n\over2}\right]\ge1-
-n\left(p-1/2\right)2/(2p) | |
e |
\geq1-\delta
See main article: Matrix Chernoff bound.
Rudolf Ahlswede and Andreas Winter introduced a Chernoff bound for matrix-valued random variables.[17] The following version of the inequality can be found in the work of Tropp.[18]
Let be independent matrix valued random variables such that
Mi\in
d1 x d2 | |
C |
E[Mi]=0
\lVertM\rVert
M
\lVertMi\rVert\leq\gamma
i\in\{1,\ldots,t\}
\Pr\left(\left\|
1 | |
t |
t | |
\sum | |
i=1 |
Mi\right\|>\varepsilon\right)\leq(d1+d2)\exp\left(-
3\varepsilon2t | |
8\gamma2 |
\right).
Notice that in order to conclude that the deviation from 0 is bounded by with high probability, we need to choose a number of samples
t
d1+d2
log(min(d1,d2))
d x d
The following theorem can be obtained by assuming M has low rank, in order to avoid the dependency on the dimensions.
Let and M be a random symmetric real matrix with
\|\operatornameE[M]\|\leq1
\|M\|\leq\gamma
t=\Omega\left(
\gammalog(\gamma/\varepsilon2) | |
\varepsilon2 |
\right).
r\leqt
\Pr\left(\left\|
1 | |
t |
t | |
\sum | |
i=1 |
Mi-\operatornameE[M]\right\|>\varepsilon\right)\leq
1 | |
poly(t) |
where are i.i.d. copies of M.
The following variant of Chernoff's bound can be used to bound the probability that a majority in a population will become a minority in a sample, or vice versa.[20]
Suppose there is a general population A and a sub-population B ⊆ A. Mark the relative size of the sub-population (|B|/|A|) by r.
Suppose we pick an integer k and a random sample S ⊂ A of size k. Mark the relative size of the sub-population in the sample (|B∩S|/|S|) by rS.
Then, for every fraction d ∈ [0,1]:
\Pr\left(rS<(1-d) ⋅ r\right)<\exp\left(-r ⋅ d2 ⋅
k | |
2\right) |
In particular, if B is a majority in A (i.e. r > 0.5) we can bound the probability that B will remain majority in S(rS > 0.5) by taking: d = 1 − 1/(2r):[21]
\Pr\left(rS>0.5\right)>1-\exp\left(-r ⋅ \left(1-
1 | |
2r |
\right)2 ⋅
k | |
2 |
\right)
This bound is of course not tight at all. For example, when r = 0.5 we get a trivial bound Prob > 0.
Following the conditions of the multiplicative Chernoff bound, let be independent Bernoulli random variables, whose sum is, each having probability pi of being equal to 1. For a Bernoulli variable:
\operatornameE
t ⋅ Xi | |
\left[e |
\right]=(1-pi)e0+piet=1+pi(et-1)\leq
pi(et-1) | |
e |
So, using with
a=(1+\delta)\mu
\delta>0
\mu=\operatornameE[X]=
n | |
style\sum | |
i=1 |
pi
\begin{align} \Pr(X>(1+\delta)\mu)&\leinft
n\operatorname{E}[\exp(tX | |
\exp(-t(1+\delta)\mu)\prod | |
i)]\\[4pt] & |
\leqinft\exp(-t(1+\delta)\mu+
n | |
\sum | |
i=1 |
t | |
p | |
i(e |
-1))\\[4pt] &=inft\exp(-t(1+\delta)\mu+(et-1)\mu). \end{align}
If we simply set so that for, we can substitute and find
\exp(-t(1+\delta)\mu+(et-1)\mu)=
\exp((1+\delta-1)\mu) | |
(1+\delta)(1+\delta)\mu |
=\left[
e\delta | |
(1+\delta)(1+\delta) |
\right]\mu.
This proves the result desired.
Let . Taking in, we obtain:
\Pr\left(
1 | |
n |
\sumXi\geq\right)\leinft>0
| |||||||
etnq |
=inft>0\left(
| |||||||
etq |
\right)n.
Now, knowing that, we have
\left(
| |||||||
etq |
\right)n=\left(
pet+(1-p) | |
etq |
\right)n=\left(pe(1-q)t+(1-p)e-qt\right)n.
Therefore, we can easily compute the infimum, using calculus:
d | |
dt |
\left(pe(1-q)t+(1-p)e-qt\right)=(1-q)pe(1-q)t-q(1-p)e-qt
Setting the equation to zero and solving, we have
\begin{align} (1-q)pe(1-q)t&=q(1-p)e-qt\\ (1-q)pet&=q(1-p) \end{align}
so that
et=
(1-p)q | |
(1-q)p |
.
Thus,
t=log\left(
(1-p)q | |
(1-q)p |
\right).
As, we see that, so our bound is satisfied on . Having solved for, we can plug back into the equations above to find that
\begin{align} log\left(pe(1-q)t+(1-p)e-qt\right)&=log\left(e-qt(1-p+pet)\right)\\ &=log\left
| ||||||
(e |
\right)+
| |||||
log\left(1-p+pe |
| ||||
e |
\right)\\ &=-qlog
1-p | |
1-q |
-qlog
q | |
p |
+log\left(1-p+p\left(
1-p | \right) | |
1-q |
q | |
p |
\right)\\ &=-qlog
1-p | |
1-q |
-qlog
q | |
p |
+log\left(
(1-p)(1-q) | + | |
1-q |
(1-p)q | |
1-q |
\right)\\ &=-qlog
q | |
p |
+\left(-qlog
1-p | |
1-q |
+log
1-p | |
1-q |
\right)\\ &=-qlog
q | |
p |
+(1-q)log
1-p | |
1-q |
\\ &=-D(q\parallelp). \end{align}
We now have our desired result, that
\Pr\left(\tfrac{1}{n}\sumXi\gep+\varepsilon\right)\lee-D(p+\varepsilon\parallel.
To complete the proof for the symmetric case, we simply define the random variable, apply the same proof, and plug it into our bound.