F-divergence explained
In probability theory, an
-divergence is a certain type of function
that measures the difference between two
probability distributions
and
. Many common divergences, such as
KL-divergence,
Hellinger distance, and
total variation distance, are special cases of
-divergence.
History
These divergences were introduced by Alfréd Rényi[1] in the same paper where he introduced the well-known Rényi entropy. He proved that these divergences decrease in Markov processes. f-divergences were studied further independently by, and and are sometimes known as Csiszár
-divergences, Csiszár–Morimoto divergences, or Ali–Silvey distances.
Definition
Non-singular case
Let
and
be two probability distributions over a space
, such that
, that is,
is absolutely continuous with respect to
. Then, for a
convex function f:[0,+infty)\to(-infty,+infty]
such that
is finite for all
,
, and
(which could be infinite), the
-divergence of
from
is defined as
Df(P\parallelQ)\equiv\int\Omegaf\left(
\right)dQ.
We call
the generator of
.
In concrete applications, there is usually a reference distribution
on
(for example, when
, the reference distribution is the
Lebesgue measure), such that
, then we can use
Radon–Nikodym theorem to take their
probability densities
and
, giving
Df(P\parallelQ)=\int\Omegaf\left(
\right)q(x)d\mu(x).
When there is no such reference distribution ready at hand, we can simply define
, and proceed as above. This is a useful technique in more abstract proofs.
The above definition can be extended to cases where
is no longer satisfied (Definition 7.1 of
[2]).
Since
is convex, and
, the function
must nondecrease, so there exists
f'(infty):=\limx\toinftyf(x)/x
, taking value in
.
Since for any
, we have
\limq(x)\toq(x)f\left(
\right)=p(x)f'(infty)
, we can extend f-divergence to the
.
Properties
Basic relations between f-divergences
given a finite sequence of nonnegative real numbers
and generators
.
iff
for some
.
Basic properties of f-divergences
In particular, the monotonicity implies that if a Markov process has a positive equilibrium probability distribution
then
is a monotonic (non-increasing) function of time, where the probability distribution
is a solution of the Kolmogorov forward equations (or
Master equation), used to describe the time evolution of the probability distribution in the Markov process. This means that all
f-divergences
are the
Lyapunov functions of the Kolmogorov forward equations. The converse statement is also true: If
is a Lyapunov function for all Markov chains with positive equilibrium
and is of the trace-form(
) then
, for some convex function
f.
[3] [4] For example,
Bregman divergences in general do not have such property and can increase in Markov processes.
[5] Analytic properties
The f-divergences can be expressed using Taylor series and rewritten using a weighted sum of chi-type distances .
Naive variational representation
Let
be the
convex conjugate of
. Let
be the
effective domain of
, that is,
effdom(f*)=\{y:f*(y)<infty\}
. Then we have two variational representations of
, which we describe below.
Basic variational representation
Under the above setup,
This is Theorem 7.24 in.
Example applications
Using this theorem on total variation distance, with generator
its convex conjugate is
f*(x*)=\begin{cases}
x*on[-1/2,1/2],\\
+inftyelse.
\end{cases}
, and we obtain
For chi-squared divergence, defined by
, we obtain
Since the variation term is not affine-invariant in
, even though the domain over which
varies
is affine-invariant, we can use up the affine-invariance to obtain a leaner expression.
Replacing
by
and taking the maximum over
, we obtain
which is just a few steps away from the
Hammersley–Chapman–Robbins bound and the
Cramér–Rao bound (Theorem 29.1 and its corollary in).
For
-divergence with
\alpha\in(-infty,0)\cup(0,1)
, we have
f\alpha(x)=
| x\alpha-\alphax-(1-\alpha) |
\alpha(\alpha-1) |
, with range
. Its convex conjugate is
with range
y\in(-infty,(1-\alpha)-1)
, where
.
Applying this theorem yields, after substitution with
,
or, releasing the constraint on
,
Setting
yields the variational representation of
-divergence obtained above.
The domain over which
varies is not affine-invariant in general, unlike the
-divergence case. The
-divergence is special, since in that case, we can remove the
from
.
For general
\alpha\in(-infty,0)\cup(0,1)
, the domain over which
varies is merely scale invariant. Similar to above, we can replace
by
, and take minimum over
to obtain
Setting
, and performing another substitution by
, yields two variational representations of the squared Hellinger distance:
Applying this theorem to the KL-divergence, defined by
, yields
This is strictly less efficient than the Donsker–Varadhan representation
This defect is fixed by the next theorem.
Improved variational representation
Assume the setup in the beginning of this section ("Variational representations").
This is Theorem 7.25 in.
Example applications
Applying this theorem to KL-divergence yields the Donsker–Varadhan representation.
Attempting to apply this theorem to the general
-divergence with
\alpha\in(-infty,0)\cup(0,1)
does not yield a closed-form solution.
Common examples of f-divergences
The following table lists many of the common divergences between probability distributions and the possible generating functions to which they correspond. Notably, except for total variation distance, all others are special cases of
-divergence, or linear sums of
-divergences.
For each f-divergence
, its generating function is not uniquely defined, but only up to
, where
is any real constant. That is, for any
that generates an f-divergence, we have
. This freedom is not only convenient, but actually necessary.
Divergence | Corresponding f(t) | Discrete Form |
---|
-divergence,
|
| t - 1 | ^ \, |
| \frac\right | ^\alpha q_i \, |
Total variation distance (
) |
| t - 1 | \, |
| p_i - q_i | \, |
α-divergence | \begin{cases}
| t\alpha-\alphat-\left(1-\alpha\right) | \alpha\left(\alpha-1\right) |
&if \alpha ≠ 0,\alpha ≠ 1,\\
tlnt-t+1,&if \alpha=1,\\
-lnt+t-1,&if \alpha=0
\end{cases}
|
KL-divergence (
) |
|
|
reverse KL-divergence (
) |
|
|
Jensen–Shannon divergence |
\left(tlnt-(t+1)ln\left(
\right)\right)
|
\sumi\left(piln
+qiln
\right)
|
Jeffrey's divergence (KL + reverse KL) |
|
|
squared Hellinger distance (
) |
|
\sumi(\sqrt{pi}-
1-\sumi\sqrt{piqi}
|
Pearson
-divergence (rescaling of
) |
|
|
Neyman
-divergence (reverse Pearson)(rescaling of
) |
|
| |
Let
be the generator of
-divergence, then
and
are convex inversions of each other, so
D\alpha(P\|Q)=D1-\alpha(Q\|P)
. In particular, this shows that the squared Hellinger distance and Jensen-Shannon divergence are symmetric.
In the literature, the
-divergences are sometimes parametrized as
\begin{cases}
(1-t(1+\alpha)/2),&if \alpha ≠ \pm1,\\
tlnt,&if \alpha=1,\\
-lnt,&if \alpha=-1
\end{cases}
which is equivalent to the parametrization in this page by substituting
.
Relations to other statistical divergences
Here, we compare f-divergences with other statistical divergences.
Rényi divergence
The Rényi divergences is a family of divergences defined by
R\alpha(P\|Q)=
\right)\alpha\right]
)
when
\alpha\in(0,1)\cup(1,+infty)
. It is extended to the cases of
by taking the limit.
Simple algebra shows that
R\alpha(P\|Q)=
ln(1+\alpha(\alpha-1)D\alpha(P\|Q))
, where
is the
-divergence defined above.
Bregman divergence
The only f-divergence that is also a Bregman divergence is the KL divergence.[6]
Integral probability metrics
The only f-divergence that is also an integral probability metric is the total variation.[7]
Financial interpretation
A pair of probability distributions can be viewed as a game of chance in which one of the distributions defines the official odds and the other contains the actual probabilities. Knowledge of the actual probabilities allows a player to profit from the game. For a large class of rational players the expected profit rate has the same general form as the ƒ-divergence.[8]
See also
References
- I. . Csiszár . 1963 . Eine informationstheoretische Ungleichung und ihre Anwendung auf den Beweis der Ergodizitat von Markoffschen Ketten . Magyar. Tud. Akad. Mat. Kutato Int. Kozl . 8 . 85–108 .
- 10.1143/JPSJ.18.328 . T. . Morimoto . 1963 . Markov processes and the H-theorem . J. Phys. Soc. Jpn. . 18 . 3 . 328–331 . 1963JPSJ...18..328M .
- S. M. . Ali . S. D. . Silvey . 1966 . A general class of coefficients of divergence of one distribution from another . Journal of the Royal Statistical Society, Series B . 28 . 1 . 131–142 . 2984279 . 0196777 .
- I. . Csiszár . 1967 . Information-type measures of difference of probability distributions and indirect observation . Studia Scientiarum Mathematicarum Hungarica . 2 . 229–318 . CITEREFCsisz.C3.A1r1967 .
- I. . Csiszár . Imre Csiszár . P. . Shields . 2004 . Information Theory and Statistics: A Tutorial . Foundations and Trends in Communications and Information Theory . 1 . 4 . 417–528 . 10.1561/0100000004 . 2009-04-08 .
- F. . Liese . I. . Vajda . 2006 . On divergences and informations in statistics and information theory . . 52 . 10 . 4394–4412 . 10.1109/TIT.2006.881731 . 2720215 .
- F. . Nielsen . R. . Nock . 2013 . On the Chi square and higher-order Chi distances for approximating f-divergences . 1309.3029 . 10.1109/LSP.2013.2288355 . 21 . IEEE Signal Processing Letters . 1 . 10–13 . 2014ISPL...21...10N. 4152365 .
- J-F. . Coeurjolly . R. . Drouilhet . 2006 . Normalized information-based divergences . math/0604246 . arXiv:math/0604246 .
Notes and References
- On measures of entropy and information. Alfréd. Rényi . 1961. The 4th Berkeley Symposium on Mathematics, Statistics and Probability, 1960. University of California Press. Berkeley, CA. 547–561. Eq. (4.20)
- Book: Polyanskiy . Yury . Information Theory: From Coding to Learning (draft of October 20, 2022) . Yihong . Wu . Cambridge University Press . 2022 . https://web.archive.org/web/20230201222557/https://people.lids.mit.edu/yp/homepage/data/itbook-export.pdf . 2023-02-01.
- Gorban. Pavel A.. 15 October 2003. Monotonically equivalent entropies and solution of additivity equation. Physica A. 328. 3–4 . 380–390. 10.1016/S0378-4371(03)00578-8 . cond-mat/0304131. 2003PhyA..328..380G. 14975501.
- Divergence, Optimization, Geometry. Shun'ichi . Amari . Shun'ichi Amari . 2009. The 16th International Conference on Neural Information Processing (ICONIP 20009), Bangkok, Thailand, 1--5 December 2009 . Leung, C.S. . Lee, M. . Chan, J.H.. Lecture Notes in Computer Science, vol 5863 . Springer . Berlin, Heidelberg . 185–193 . 10.1007/978-3-642-10677-4_21 .
- Gorban. Alexander N.. 29 April 2014. General H-theorem and Entropies that Violate the Second Law. Entropy. 16. 5. 2408–2432. 10.3390/e16052408. 1212.6767. 2014Entrp..16.2408G. free.
- Jiao . Jiantao . Courtade . Thomas . No . Albert . Venkat . Kartik . Weissman . Tsachy . December 2014 . Information Measures: the Curious Case of the Binary Alphabet . IEEE Transactions on Information Theory . 60 . 12 . 7616–7626 . 10.1109/TIT.2014.2360184 . 0018-9448. 1404.6810 . 13108908 .
- 0901.2698 . Sriperumbudur . Bharath K. . Fukumizu . Kenji . Gretton . Arthur . Schölkopf . Bernhard . Bernhard Schölkopf . Lanckriet . Gert R. G. . On integral probability metrics, φ-divergences and binary classification . 2009 . cs.IT .
- Soklakov. Andrei N.. 2020. Economics of Disagreement—Financial Intuition for the Rényi Divergence. Entropy. 22. 8. 860. 10.3390/e22080860. 33286632. 7517462. 1811.08308. 2020Entrp..22..860S. free.