Integral probability metric explained
In probability theory, integral probability metrics are types of distance functions between probability distributions, defined by how well a class of functions can distinguish the two distributions. Many important statistical distances are integral probability metrics, including the Wasserstein-1 distance and the total variation distance. In addition to theoretical importance, integral probability metrics are widely used in areas of statistics and machine learning.
The name "integral probability metric" was given by German statistician Alfred Müller;[1] the distances had also previously been called "metrics with a -structure."[2]
Definition
Integral probability metrics (IPMs) are distances on the space of distributions over a set
, defined by a class
of real-valued functions on
as
here the notation refers to the expectation of under the distribution . The absolute value in the definition is unnecessary, and often omitted, for the usual case where for every
its negation
is also in
.
The functions being optimized over are sometimes called "critic" functions;[3] if a particular
achieves the supremum, it is often termed a "witness function"
[4] (it "witnesses" the difference in the distributions). These functions try to have large values for samples from and small (likely negative) values for samples from ; this can be thought of as a weaker version of
classifers, and indeed IPMs can be interpreted as the optimal
risk of a particular classifier.
The choice of
determines the particular distance; more than one
can generate the same distance.
[1] For any choice of
,
satisfies all the definitions of a metric except that we may have we may have
for some ; this is variously termed a "pseudometric" or a "semimetric" depending on the community. For instance, using the class
which only contains the zero function,
is identically zero.
is a metric if and only if
separates points on the space of probability distributions, i.e. for any there is some
such that
;
[1] most, but not all, common particular cases satisfy this property.
Examples
All of these examples are metrics except when noted otherwise.
the set of 1-
Lipschitz functions.
- The related Dudley metric is generated by the set of bounded 1-Lipschitz functions.
- The total variation distance can be generated by
, so that
is a set of
indicator functions for any event, or by the larger class
.
- The closely related Radon metric is generated by continuous functions bounded in .
- The Kolmogorov metric used in the Kolmogorov-Smirnov test has a function class of indicator functions,
.
- The kernel maximum mean discrepancy (MMD) has
the
unit ball in a
reproducing kernel Hilbert space. This distance is particularly easy to estimate from samples, requiring no optimization; it is a proper metric exactly when the underlying kernel is
characteristic.[5] - The energy distance, as a special case of the maximum mean discrepancy,[6] is generated by the unit ball in a particular reproducing kernel Hilbert space.
- Defining
by functions with a bounded
Sobolev norm gives a useful distance for
generative modeling, among other applications.
[7]
is a class of
neural networks; these are not metrics for typical fixed-size networks, but could be for other classifiers. For
Wasserstein GANs in particular, it has been argued that analysis in terms of this distance and not the Wasserstein they approximate is very important to the behavior of these models.
[12] [14] [15] Relationship to -divergences
The -divergences are probably the best-known way to measure dissimilarity of probability distributions. It has been shown that the only functions which are both IPMs and -divergences are of the form
, where
and
is the total variation distance between distributions.
One major difference between -divergences and most IPMs is that when and have disjoint support, all -divergences take on a constant value;[16] by contrast, IPMs where functions in
are "smooth" can give "partial credit." For instance, consider the sequence
of
Dirac measures at ; this sequence converges in distribution to
, and many IPMs satisfy
DlF(\delta1/n,\delta0)\to0
, but no nonzero -divergence can satisfy this. That is, many IPMs are continuous in
weaker topologies than -divergences. This property is sometimes of substantial importance,
[17] although other options also exist, such as considering -divergences between distributions
convolved with continuous noise.
[17] [18] Estimation from samples
Because IPM values between discrete distributions are often sensible, it is often reasonable to estimate
using a simple "plug-in" estimator:
where
and
are
empirical measures of sample sets. These empirical distances can be computed exactly for some classes
;
[19] estimation quality varies depending on the distance, but can be
minimax-optimal in certain settings.
[20] [21] When exact maximization is not available or too expensive, another commonly used scheme is to divide the samples into "training" sets (with empirical measures
and
) and "test" sets (
and
), find
approximately maximizing
|\hatPtrainf-\hatQtrainf|
, then use
|\hatPtest\hatf-\hatQtest\hatf|
as an estimate.
[22] [11] [23] [24] This estimator can possibly be
consistent, but has a negative bias. In fact, no
unbiased estimator can exist for any IPM, although there is for instance an unbiased estimator of the
squared maximum mean discrepancy.
Notes and References
- Müller . Alfred . Integral Probability Metrics and Their Generating Classes of Functions . Advances in Applied Probability . June 1997 . 29 . 2 . 429–443 . 10.2307/1428011. 1428011 . 124648603 .
- Zolotarev . V. M. . Probability Metrics . Theory of Probability & Its Applications . January 1984 . 28 . 2 . 278–302 . 10.1137/1128025.
- Arjovsky . Martin . Chintala . Soumith . Bottou . Léon . 2017-07-17 . Wasserstein Generative Adversarial Networks . International Conference on Machine Learning . en . PMLR . 214–223.
- Gretton . Arthur . Borgwardt . Karsten M. . Rasche . Malte J. . Schölkopf . Bernhard . Smola . Alexander . A Kernel Two-Sample Test . Journal of Machine Learning Research . 2012 . 13 . 723–773 .
- Kenji . Fukumizu . Arthur . Gretton . Xiaohui . Sun . Bernhard . Schölkopf . Bernhard Schölkopf . 2007 . Kernel Measures of Conditional Dependence . .
- Dino . Sejdinovic . Bharath . Sriperumbudur . Arthur . Gretton . Kenji . Fukumizu . 2013. Equivalence of distance-based and RKHS-based statistics in hypothesis testing . The Annals of Statistics. 41. 5. 2263–2291. 10.1214/13-aos1140. 1207.6076. 8308769 .
- Mroueh . Youssef . Li . Chun-Liang . Sercu . Tom . Raj . Anant . Cheng . Yu . Sobolev GAN . . 2018 . 1711.04894 .
- Uppal . Ananya . Singh . Shashank . Póczos . Barnabás . Nonparametric Density Estimation & Convergence Rates for GANs under Besov IPM Losses . . 2019 . 1902.03511.
- Uppal . Ananya . Singh . Shashank . Póczos . Barnabás . Robust Density Estimation under Besov IPM Losses . . 2020 . 2004.08597.
- Kim . Ilmun . Ramdas . Aaditya . Singh . Aarti . Wasserman . Larry . Larry A. Wasserman . Classification accuracy as a proxy for two-sample testing . . February 2021 . 49 . 1 . 10.1214/20-AOS1962 . 1703.00573 . 17668083 .
- Lopez-Paz . David . Oquab . Maxime . Revisiting Classifier Two-Sample Tests . . 2017 . 1610.06545 .
- Arora . Sanjeev . Sanjeev Arora . Ge . Rong . Liang . Yingyu . Ma . Tengyu . Zhang . Yi . Generalization and Equilibrium in Generative Adversarial Nets (GANs) . 2017 . International Conference on Machine Learning. 1703.00573 .
- Ji . Kaiyi . Liang . Yingbin . Minimax Estimation of Neural Net Distance . . 2018. 1811.01054 .
- 2103.01678 . Stanczuk . Jan . Etmann . Christian . Lisa Maria Kreusser . Schönlieb . Carola-Bibiane . Wasserstein GANs Work Because They Fail (To Approximate the Wasserstein Distance) . 2021 . stat.ML .
- 1910.03875 . Mallasto . Anton . Montúfar . Guido . Gerolin . Augusto . How Well do WGANs Estimate the Wasserstein Metric? . 2019 . cs.LG .
- Web site: Computing Jensen-Shannon Divergence between discrete and continuous distribution . Danica J. . Sutherland . 18 July 2023 . Stack Exchange Network.
- Martin . Arjovsky . Léon . Bettou . Léon Bottou . Towards Principled Methods for Training Generative Adversarial Networks . . 2017 . 1701.04862.
- 1610.04490. Appendix C . Casper Kaae . Sønderby . Caballero . Jose . Theis . Lucas . Shi . Wenzhe . Huszár . Ferenc . Amortised MAP Inference for Image Super-resolution . 2017 . .
- 0901.2698 . Sriperumbudur . Bharath K. . Fukumizu . Kenji . Gretton . Arthur . Schölkopf . Bernhard . Bernhard Schölkopf . Lanckriet . Gert R. G. . On integral probability metrics, φ-divergences and binary classification . 2009 . cs.IT .
- Tolstikhin . Ilya O. . Sriperumbudur . Bharath K. . Schölkopf . Bernhard . Bernhard Schölkopf . Minimax Estimation of Maximum Mean Discrepancy with Radial Kernels . . 2016 .
- 1802.08855 . Singh . Shashank . Póczos . Barnabás . Minimax Distribution Estimation in Wasserstein Distance . 2018 . math.ST .
- 1801.01401 . Bińkowski . Mikołaj . Sutherland . Danica J. . Arbel . Michael . Gretton . Arthur . Demystifying MMD GANs . 2018 . .
- 2002.09116 . Liu . Feng . Xu . Wenkai . Lu . Jie . Zhang . Guangquan . Gretton . Arthur . Sutherland . Danica J. . Learning Deep Kernels for Non-Parametric Two-Sample Tests . 2020 . International Conference on Machine Learning.
- Kübler . Jonas M. . Jitkrittum . Wittawat . Schölkopf . Bernhard . Bernhard Schölkopf . Muandent . Krikamol . A Witness Two-Sample Test . International Conference on Artificial Intelligence and Statistics . 2021 . 2102.05573.