Distribution learning theory explained
The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 [1] and it was inspired from the PAC-framework introduced by Leslie Valiant.[2]
In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality, this framework has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics.
This article explains the basic definitions, tools and results in this framework from the theory of computation point of view.
Definitions
Let
be the support of the distributions of interest. As in the original work of Kearns et al.
[1] if
is finite it can be assumed without loss of generality that
where
is the number of bits that have to be used in order to represent any
. We focus in probability distributions over
.
There are two possible representations of a probability distribution
over
.
- probability distribution function (or evaluator) an evaluator
for
takes as input any
and outputs a real number
which denotes the probability that of
according to
, i.e.
if
.
for
takes as input a string of truly random bits
and outputs
according to the distribution
. Generator can be interpreted as a routine that simulates sampling from the distribution
given a sequence of fair coin tosses.
A distribution
is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time.
Let
a class of distribution over X, that is
is a set such that every
is a probability distribution with support
. The
can also be written as
for simplicity.
Before defining learnability, it is necessary to define good approximations of a distribution
. There are several ways to measure the distance between two distribution. The three more common possibilities are
The strongest of these distances is the Kullback-Leibler divergence and the weakest is the Kolmogorov distance. This means that for any pair of distributions
,
:
KL-distance(D,D')\geTV-distance(D,D')\geKolmogorov-distance(D,D')
Therefore, for example if
and
are close with respect to
Kullback-Leibler divergence then they are also close with respectto all the other distances.
Next definitions hold for all the distances and therefore the symbol
denotes the distance between the distribution
and the distribution
using one of the distances that we describe above. Although learnability of a class of distributions can be defined using any of these distances, applications refer to a specific distance.
The basic input that we use in order to learn a distribution is a number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle
that returns a sample from the distribution
. Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution
in class of distributions
. This quantity is called
sample complexity of the learning algorithm.
In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in.[3] In this framework of statistical learning theory a training set
styleS=\{(x1,y1),...,(xn,yn)\}
and the goal is to find a target function
that minimizes some loss function, e.g. the square loss function. More formally
f=\argming\intV(y,g(x))d\rho(x,y)
, where
is the loss function, e.g.
and
the probability distribution according to which the elements of the training set are sampled. If the
conditional probability distribution
is known then the target function has the closed form
. So the set
is a set of samples from the
probability distribution
. Now the goal of distributional learning theory if to find
given
which can be used to find the target function
.
Definition of learnability
A class of distributions
is called
efficiently learnable if for every
and
given access to
for an unknown distribution
, there exists a polynomial time algorithm
, called learning algorithm of
, that outputs a generator or an evaluator of a distribution
such that
\Pr[d(D,D')\le\epsilon]\ge1-\delta
If we know that
then
is called
proper learning algorithm, otherwise is called
improper learning algorithm.
In some settings the class of distributions
is a class with well known distributions which can be described by a set of parameters. For instance
could be the class of all the Gaussian distributions
. In this case the algorithm
should be able to estimate the parameters
. In this case
is called
parameter learning algorithm.
Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description.
First results
In their seminal work, Kearns et al. deal with the case where
is described in term of a finite polynomial sized circuit and they proved the following for some specific classes of distribution.
gate distributions
for this kind of distributions there is no polynomial-sized evaluator, unless
. On the other hand, this class is efficiently learnable with generator.- Parity gate distributions this class is efficiently learnable with both generator and evaluator.
- Mixtures of Hamming Balls this class is efficiently learnable with both generator and evaluator.
- Probabilistic Finite Automata this class is not efficiently learnable with evaluator under the Noisy Parity Assumption which is an impossibility assumption in the PAC learning framework.
Covers
One very common technique in order to find a learning algorithm for a class of distributions
is to first find a small
cover of
.
Definition
A set
is called
-cover of
if for every
there is a
such that
. An
cover is small if it has polynomial size with respect to the parameters that describe
.
Once there is an efficient procedure that for every
finds a small
cover
of C then the only left task is to select from
the distribution
that is closer to the distribution
that has to be learned.
The problem is that given
it is not trivial how we can compare
and
in order to decide which one is the closest to
, because
is unknown. Therefore, the samples from
have to be used to do these comparisons. Obviously the result of the comparison always has a probability of error. So the task is similar with finding the minimum in a set of element using noisy comparisons. There are a lot of classical algorithms in order to achieve this goal. The most recent one which achieves the best guarantees was proposed by
Daskalakis and Kamath
[4] This algorithm sets up a fast tournament between the elements of
where the winner
of this tournament is the element which is
close to
(i.e.
) with probability at least
. In order to do so their algorithm uses
samples from
and runs in
time, where
.
Learning sums of random variables
Learning of simple well known distributions is a well studied field and there are a lot of estimators that can be used. One more complicated class of distributions is the distribution of a sum of variables that follow simple distributions. These learning procedure have a close relation with limit theorems like the central limit theorem because they tend to examine the same object when the sum tends to an infinite sum. Recently there are two results that described here include the learning Poisson binomial distributions and learning sums of independent integer random variables. All the results below hold using the total variation distance as a distance measure.
Learning Poisson binomial distributions
Consider
independent Bernoulli random variables
with probabilities of success
. A Poisson Binomial Distribution of order
is the distribution of the sum
. For learning the class
stylePBD=\{D:D~isaPoissonbinomialdistribution\}
. The first of the following results deals with the case of improper learning of
and the second with the proper learning of
.
[5] Theorem
Let
then there is an algorithm which given
,
,
and access to
finds a
such that
style\Pr[d(D,D')\le\epsilon]\ge1-\delta
. The sample complexity of this algorithm is
style\tilde{O}((1/\epsilon3)log(1/\delta))
and the running time is
style\tilde{O}((1/\epsilon3)lognlog2(1/\delta))
.
Theorem
Let
then there is an algorithm which given
,
,
and access to
finds a
such that
style\Pr[d(D,D')\le\epsilon]\ge1-\delta
. The sample complexity of this algorithm is
style\tilde{O}((1/\epsilon2))log(1/\delta)
and the running time is
style(1/
| O(log2(1/\epsilon)) |
\epsilon) | |
\tilde{O}(lognlog(1/\delta))
.
One part of the above results is that the sample complexity of the learning algorithm doesn't depend on
, although the description of
is linear in
. Also the second result is almost optimal with respect to the sample complexity because there is also a lower bound of
.
The proof uses a small
cover of
that has been produced by Daskalakis and Papadimitriou,
[6] in order to get this algorithm.
Learning Sums of Independent Integer Random Variables
Consider
independent random variables
each of which follows an arbitrary distribution with support
. A
sum of independent integer random variable of order
is the distribution of the sum
. For learning the class
stylek-SIIRV=\{D:Disak-sumofindependentintegerrandomvariable\}
there is the following result
Theorem
Let
then there is an algorithm which given
,
and access to
finds a
such that
style\Pr[d(D,D')\le\epsilon]\ge1-\delta
. The sample complexity of this algorithm is
and the running time is also
.
Another part is that the sample and the time complexity does not depend on
. Its possible to conclude this independence for the previous section if we set
.
[7] Learning mixtures of Gaussians
Let the random variables
styleX\simN(\mu1,\Sigma1)
and
styleY\simN(\mu2,\Sigma2)
. Define the random variable
which takes the same value as
with probability
and the same value as
with probability
. Then if
is the density of
and
is the density of
the density of
is
. In this case
is said to follow a mixture of Gaussians. Pearson
[8] was the first who introduced the notion of the mixtures of Gaussians in his attempt to explain the probability distribution from which he got same data that he wanted to analyze. So after doing a lot of calculations by hand, he finally fitted his data to a mixture of Gaussians. The learning task in this case is to determine the parameters of the mixture
stylew1,w2,\mu1,\mu2,\Sigma1,\Sigma2
.
The first attempt to solve this problem was from Dasgupta.[9] In this work Dasgupta assumes that the two means of the Gaussians are far enough from each other. This means that there is a lower bound on the distance
. Using this assumption Dasgupta and a lot of scientists after him were able to learn the parameters of the mixture. The learning procedure starts with
clustering the samples into two different clusters minimizing some metric. Using the assumption that the means of the Gaussians are far away from each other with high probability the samples in the first cluster correspond to samples from the first Gaussian and the samples in the second cluster to samples from the second one. Now that the samples are partitioned the
can be computed from simple statistical estimators and
by comparing the magnitude of the clusters.
If
is the set of all the mixtures of two Gaussians, using the above procedure theorems like the following can be proved.
Theorem [9]
Let
with
style||\mu1-\mu2||\gec\sqrt{nmax(λmax(\Sigma1),λmax(\Sigma2))}
, where
and
the largest eigenvalue of
, then there is an algorithm which given
,
and access to
finds an approximation
of the parameters such that
style\Pr[||wi-w'i||\le\epsilon]\ge1-\delta
(respectively for
and
. The sample complexity of this algorithm is
styleM=
| O(log2(1/(\epsilon\delta))) |
2 | |
and the running time is
.
The above result could also be generalized in
mixture of Gaussians.
[9] For the case of mixture of two Gaussians there are learning results without the assumption of the distance between their means, like the following one which uses the total variation distance as a distance measure.
Theorem
Let
then there is an algorithm which given
,
and access to
finds
such that if
, where
styleF'i=N(\mu'i,\Sigma'i)
then
style\Pr[d(F,F')\le\epsilon]\ge1-\delta
. The sample complexity and the running time of this algorithm is
stylepoly(n,1/\epsilon,1/\delta,1/w1,1/w2,1/d(F1,F2))
.
The distance between
and
doesn't affect the quality of the result of the algorithm but just the sample complexity and the running time.
[10] References
- M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, L. Sellie On the Learnability of Discrete Distributions. ACM Symposium on Theory of Computing, 1994 http://dl.acm.org/citation.cfm?id=195155
- http://dl.acm.org/citation.cfm?id=1972 L. Valiant A theory of the learnable. Communications of ACM, 1984
- Lorenzo Rosasco, Tomaso Poggio, "A Regularization Tour of Machine Learning — MIT-9.520 Lectures Notes" Manuscript, Dec. 2014 https://www.mit.edu/~9.520/fall14/
- C. Daskalakis, G. Kamath Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians. Annual Conference on Learning Theory, 2014 https://arxiv.org/abs/1312.1054
- C. Daskalakis, I. Diakonikolas, R. Servedio Learning Poisson Binomial Distributions. ACM Symposium on Theory of Computing, 2012 http://dl.acm.org/citation.cfm?id=2214042
- C. Daskalakis, C. Papadimitriou Sparse Covers for Sums of Indicators. Probability Theory and Related Fields, 2014 https://link.springer.com/article/10.1007%2Fs00440-014-0582-8
- C. Daskalakis, I. Diakonikolas, R. O’Donnell, R. Servedio, L. Tan Learning Sums of Independent Integer Random Variables. IEEE Symposium on Foundations of Computer Science, 2013 http://dl.acm.org/citation.cfm?id=2570450.2570592
- K. Pearson Contribution to the Mathematical Theory of Evolution. Philosophical Transactions of the Royal Society in London, 1894 https://www.jstor.org/discover/10.2307/90649?sid=21104909080371&uid=4&uid=2
- S. Dasgupta Learning Mixtures of Gaussians. IEEE Symposium on Foundations of Computer Science, 1999 http://dl.acm.org/citation.cfm?id=796496
- A. Kalai, A. Moitra, G. Valiant Efficiently Learning Mixtures of Two Gaussians ACM Symposium on Theory of Computing, 2010 http://dl.acm.org/citation.cfm?id=1806765