Distribution learning theory explained

The distributional learning theory or learning of probability distribution is a framework in computational learning theory. It has been proposed from Michael Kearns, Yishay Mansour, Dana Ron, Ronitt Rubinfeld, Robert Schapire and Linda Sellie in 1994 ^[1] and it was inspired from the PAC-framework introduced by Leslie Valiant.^[2]

In this framework the input is a number of samples drawn from a distribution that belongs to a specific class of distributions. The goal is to find an efficient algorithm that, based on these samples, determines with high probability the distribution from which the samples have been drawn. Because of its generality, this framework has been used in a large variety of different fields like machine learning, approximation algorithms, applied probability and statistics.

This article explains the basic definitions, tools and results in this framework from the theory of computation point of view.

Definitions

Let

styleX

be the support of the distributions of interest. As in the original work of Kearns et al.^[1] if

styleX

is finite it can be assumed without loss of generality that

styleX=\{0,1\}ⁿ

where

stylen

is the number of bits that have to be used in order to represent any

styley\inX

. We focus in probability distributions over

styleX

There are two possible representations of a probability distribution

styleD

over

styleX

probability distribution function (or evaluator) an evaluator

styleE_D

for

styleD

takes as input any

styley\inX

and outputs a real number

styleE_D[y]

which denotes the probability that of

styley

according to

styleD

, i.e.

styleE_D[y]=\Pr[Y=y]

styleY\simD

generator a generator

styleG_D

for

styleD

takes as input a string of truly random bits

styley

and outputs

styleG_D[y]\inX

according to the distribution

styleD

. Generator can be interpreted as a routine that simulates sampling from the distribution

styleD

given a sequence of fair coin tosses.

A distribution

styleD

is called to have a polynomial generator (respectively evaluator) if its generator (respectively evaluator) exists and can be computed in polynomial time.

Let

styleC_X

a class of distribution over X, that is

styleC_X

is a set such that every

styleD\inC_X

is a probability distribution with support

styleX

. The

styleC_X

can also be written as

styleC

for simplicity.

Before defining learnability, it is necessary to define good approximations of a distribution

styleD

. There are several ways to measure the distance between two distribution. The three more common possibilities are

Kullback-Leibler divergence
Total variation distance of probability measures
Kolmogorov distance

The strongest of these distances is the Kullback-Leibler divergence and the weakest is the Kolmogorov distance. This means that for any pair of distributions

styleD

styleD'

KL-distance(D,D')\geTV-distance(D,D')\geKolmogorov-distance(D,D')

Therefore, for example if

styleD

and

styleD'

are close with respect to Kullback-Leibler divergence then they are also close with respectto all the other distances.

Next definitions hold for all the distances and therefore the symbol

styled(D,D')

denotes the distance between the distribution

styleD

and the distribution

styleD'

using one of the distances that we describe above. Although learnability of a class of distributions can be defined using any of these distances, applications refer to a specific distance.

The basic input that we use in order to learn a distribution is a number of samples drawn by this distribution. For the computational point of view the assumption is that such a sample is given in a constant amount of time. So it's like having access to an oracle

styleGEN(D)

that returns a sample from the distribution

styleD

. Sometimes the interest is, apart from measuring the time complexity, to measure the number of samples that have to be used in order to learn a specific distribution

styleD

in class of distributions

styleC

. This quantity is called sample complexity of the learning algorithm.

In order for the problem of distribution learning to be more clear consider the problem of supervised learning as defined in.^[3] In this framework of statistical learning theory a training set

styleS=\{(x_1,y_1),...,(x_n,y_n)\}

and the goal is to find a target function

stylef:X → Y

that minimizes some loss function, e.g. the square loss function. More formally

f=\argmin_g\intV(y,g(x))d\rho(x,y)

, where

V( ⋅ , ⋅ )

is the loss function, e.g.

V(y,z)=(y-z)²

and

\rho(x,y)

the probability distribution according to which the elements of the training set are sampled. If the conditional probability distribution

\rho_x(y)

is known then the target function has the closed form

f(x)=\int_yyd\rho_x(y)

. So the set

is a set of samples from the probability distribution

\rho(x,y)

. Now the goal of distributional learning theory if to find

\rho

given

which can be used to find the target function

Definition of learnability

A class of distributions

styleC

is called efficiently learnable if for every

style\epsilon>0

and

style0<\delta\le1

given access to

styleGEN(D)

for an unknown distribution

styleD\inC

, there exists a polynomial time algorithm

styleA

, called learning algorithm of

styleC

, that outputs a generator or an evaluator of a distribution

styleD'

such that

\Pr[d(D,D')\le\epsilon]\ge1-\delta

If we know that

styleD'\inC

then

styleA

is called proper learning algorithm, otherwise is called improper learning algorithm.

In some settings the class of distributions

styleC

is a class with well known distributions which can be described by a set of parameters. For instance

styleC

could be the class of all the Gaussian distributions

styleN(\mu,\sigma²⁾

. In this case the algorithm

styleA

should be able to estimate the parameters

style\mu,\sigma

. In this case

styleA

is called parameter learning algorithm.

Obviously the parameter learning for simple distributions is a very well studied field that is called statistical estimation and there is a very long bibliography on different estimators for different kinds of simple known distributions. But distributions learning theory deals with learning class of distributions that have more complicated description.

First results

In their seminal work, Kearns et al. deal with the case where

styleA

is described in term of a finite polynomial sized circuit and they proved the following for some specific classes of distribution.

styleOR

gate distributions for this kind of distributions there is no polynomial-sized evaluator, unless
style\#P\subseteqP/poly

. On the other hand, this class is efficiently learnable with generator.

Parity gate distributions this class is efficiently learnable with both generator and evaluator.
Mixtures of Hamming Balls this class is efficiently learnable with both generator and evaluator.
Probabilistic Finite Automata this class is not efficiently learnable with evaluator under the Noisy Parity Assumption which is an impossibility assumption in the PAC learning framework.

style\epsilon-

Covers

One very common technique in order to find a learning algorithm for a class of distributions

styleC

is to first find a small

style\epsilon-

cover of

styleC

Definition

A set

styleC_\epsilon

is called

style\epsilon

-cover of

styleC

if for every

styleD\inC

there is a

styleD'\inC_\epsilon

such that

styled(D,D')\le\epsilon

. An

style\epsilon-

cover is small if it has polynomial size with respect to the parameters that describe

styleD

Once there is an efficient procedure that for every

style\epsilon>0

finds a small

style\epsilon-

cover

styleC_\epsilon

of C then the only left task is to select from

styleC_\epsilon

the distribution

styleD'\inC_\epsilon

that is closer to the distribution

styleD\inC

that has to be learned.

The problem is that given

styleD',D''\inC_\epsilon

it is not trivial how we can compare

styled(D,D')

and

styled(D,D'')

in order to decide which one is the closest to

styleD

, because

styleD

is unknown. Therefore, the samples from

styleD

have to be used to do these comparisons. Obviously the result of the comparison always has a probability of error. So the task is similar with finding the minimum in a set of element using noisy comparisons. There are a lot of classical algorithms in order to achieve this goal. The most recent one which achieves the best guarantees was proposed by Daskalakis and Kamath^[4] This algorithm sets up a fast tournament between the elements of

styleC_\epsilon

where the winner

styleD^*

of this tournament is the element which is

style\epsilon-

close to

styleD

(i.e.

styled(D^*,D)\le\epsilon

) with probability at least

style1-\delta

. In order to do so their algorithm uses

styleO(logN/\epsilon²⁾

samples from

styleD

and runs in

styleO(NlogN/\epsilon²⁾

time, where

styleN=|C_\epsilon|

Learning sums of random variables

Learning of simple well known distributions is a well studied field and there are a lot of estimators that can be used. One more complicated class of distributions is the distribution of a sum of variables that follow simple distributions. These learning procedure have a close relation with limit theorems like the central limit theorem because they tend to examine the same object when the sum tends to an infinite sum. Recently there are two results that described here include the learning Poisson binomial distributions and learning sums of independent integer random variables. All the results below hold using the total variation distance as a distance measure.

Learning Poisson binomial distributions

Consider

stylen

independent Bernoulli random variables

styleX_1,...,X_n

with probabilities of success

stylep_1,...,p_n

. A Poisson Binomial Distribution of order

stylen

is the distribution of the sum

styleX=\sum_iX_i

. For learning the class

stylePBD=\{D:D~isaPoissonbinomialdistribution\}

. The first of the following results deals with the case of improper learning of

stylePBD

and the second with the proper learning of

stylePBD

.^[5]

Theorem

Let

styleD\inPBD

then there is an algorithm which given

stylen

style\epsilon>0

style0<\delta\le1

and access to

styleGEN(D)

finds a

styleD'

such that

style\Pr[d(D,D')\le\epsilon]\ge1-\delta

. The sample complexity of this algorithm is

style\tilde{O}((1/\epsilon³)log(1/\delta))

and the running time is

style\tilde{O}((1/\epsilon³⁾lognlog²(1/\delta))

Theorem

Let

styleD\inPBD

then there is an algorithm which given

stylen

style\epsilon>0

style0<\delta\le1

and access to

styleGEN(D)

finds a

styleD'\inPBD

such that

style\Pr[d(D,D')\le\epsilon]\ge1-\delta

. The sample complexity of this algorithm is

style\tilde{O}((1/\epsilon²))log(1/\delta)

and the running time is

style(1/

	O(log²(1/\epsilon))
\epsilon)

\tilde{O}(lognlog(1/\delta))

One part of the above results is that the sample complexity of the learning algorithm doesn't depend on

stylen

, although the description of

styleD

is linear in

stylen

. Also the second result is almost optimal with respect to the sample complexity because there is also a lower bound of

styleO(1/\epsilon²⁾

The proof uses a small

style\epsilon-

cover of

stylePBD

that has been produced by Daskalakis and Papadimitriou,^[6] in order to get this algorithm.

Learning Sums of Independent Integer Random Variables

Consider

stylen

independent random variables

styleX_1,...,X_n

each of which follows an arbitrary distribution with support

style\{0,1,...,k-1\}

. A

stylek-

sum of independent integer random variable of order

stylen

is the distribution of the sum

styleX=\sum_iX_i

. For learning the class

stylek-SIIRV=\{D:Disak-sumofindependentintegerrandomvariable\}

there is the following result

Theorem

Let

styleD\ink-SIIRV

then there is an algorithm which given

stylen

style\epsilon>0

and access to

styleGEN(D)

finds a

styleD'

such that

style\Pr[d(D,D')\le\epsilon]\ge1-\delta

. The sample complexity of this algorithm is

stylepoly(k/\epsilon)

and the running time is also

stylepoly(k/\epsilon)

Another part is that the sample and the time complexity does not depend on

stylen

. Its possible to conclude this independence for the previous section if we set

stylek=2

.^[7]

Learning mixtures of Gaussians

Let the random variables

styleX\simN(\mu_1,\Sigma₁₎

and

styleY\simN(\mu_2,\Sigma₂₎

. Define the random variable

styleZ

which takes the same value as

styleX

with probability

stylew₁

and the same value as

styleY

with probability

stylew₂=1-w₁

. Then if

styleF₁

is the density of

styleX

and

styleF₂

is the density of

styleY

the density of

styleZ

styleF=w₁F₁+w₂F₂

. In this case

styleZ

is said to follow a mixture of Gaussians. Pearson ^[8] was the first who introduced the notion of the mixtures of Gaussians in his attempt to explain the probability distribution from which he got same data that he wanted to analyze. So after doing a lot of calculations by hand, he finally fitted his data to a mixture of Gaussians. The learning task in this case is to determine the parameters of the mixture

stylew_1,w_2,\mu_1,\mu_2,\Sigma_1,\Sigma₂

The first attempt to solve this problem was from Dasgupta.^[9] In this work Dasgupta assumes that the two means of the Gaussians are far enough from each other. This means that there is a lower bound on the distance

style||\mu₁-\mu_2||

. Using this assumption Dasgupta and a lot of scientists after him were able to learn the parameters of the mixture. The learning procedure starts with clustering the samples into two different clusters minimizing some metric. Using the assumption that the means of the Gaussians are far away from each other with high probability the samples in the first cluster correspond to samples from the first Gaussian and the samples in the second cluster to samples from the second one. Now that the samples are partitioned the

style\mu_i,\Sigma_i

can be computed from simple statistical estimators and

stylew_i

by comparing the magnitude of the clusters.

styleGM

is the set of all the mixtures of two Gaussians, using the above procedure theorems like the following can be proved.

Theorem ^[9]

Let

styleD\inGM

with

style||\mu₁-\mu_2||\gec\sqrt{nmax(λ_max(\Sigma_1),λ_max(\Sigma_2))}

, where

stylec>1/2

and

styleλ_max(A)

the largest eigenvalue of

styleA

, then there is an algorithm which given

style\epsilon>0

style0<\delta\le1

and access to

styleGEN(D)

finds an approximation

stylew'_i,\mu'_i,\Sigma'_i

of the parameters such that

style\Pr[||w_i-w'_i||\le\epsilon]\ge1-\delta

(respectively for

style\mu_i

and

style\Sigma_i

. The sample complexity of this algorithm is

styleM=

	O(log²(1/(\epsilon\delta)))
2

and the running time is

styleO(M²d+Mdn)

The above result could also be generalized in

stylek-

mixture of Gaussians.^[9]

For the case of mixture of two Gaussians there are learning results without the assumption of the distance between their means, like the following one which uses the total variation distance as a distance measure.

Theorem

Let

styleF\inGM

then there is an algorithm which given

style\epsilon>0

style0<\delta\le1

and access to

styleGEN(D)

finds

stylew'_i,\mu'_i,\Sigma'_i

such that if

styleF'=w'₁F'₁+w'₂F'₂

, where

styleF'_i=N(\mu'_i,\Sigma'_i)

then

style\Pr[d(F,F')\le\epsilon]\ge1-\delta

. The sample complexity and the running time of this algorithm is

stylepoly(n,1/\epsilon,1/\delta,1/w_1,1/w_2,1/d(F_1,F₂₎₎

The distance between

styleF₁

and

styleF₂

doesn't affect the quality of the result of the algorithm but just the sample complexity and the running time.^[10]

References

M. Kearns, Y. Mansour, D. Ron, R. Rubinfeld, R. Schapire, L. Sellie On the Learnability of Discrete Distributions. ACM Symposium on Theory of Computing, 1994 http://dl.acm.org/citation.cfm?id=195155
http://dl.acm.org/citation.cfm?id=1972 L. Valiant A theory of the learnable. Communications of ACM, 1984
Lorenzo Rosasco, Tomaso Poggio, "A Regularization Tour of Machine Learning — MIT-9.520 Lectures Notes" Manuscript, Dec. 2014 https://www.mit.edu/~9.520/fall14/
C. Daskalakis, G. Kamath Faster and Sample Near-Optimal Algorithms for Proper Learning Mixtures of Gaussians. Annual Conference on Learning Theory, 2014 https://arxiv.org/abs/1312.1054
C. Daskalakis, I. Diakonikolas, R. Servedio Learning Poisson Binomial Distributions. ACM Symposium on Theory of Computing, 2012 http://dl.acm.org/citation.cfm?id=2214042
C. Daskalakis, C. Papadimitriou Sparse Covers for Sums of Indicators. Probability Theory and Related Fields, 2014 https://link.springer.com/article/10.1007%2Fs00440-014-0582-8
C. Daskalakis, I. Diakonikolas, R. O’Donnell, R. Servedio, L. Tan Learning Sums of Independent Integer Random Variables. IEEE Symposium on Foundations of Computer Science, 2013 http://dl.acm.org/citation.cfm?id=2570450.2570592
K. Pearson Contribution to the Mathematical Theory of Evolution. Philosophical Transactions of the Royal Society in London, 1894 https://www.jstor.org/discover/10.2307/90649?sid=21104909080371&uid=4&uid=2
S. Dasgupta Learning Mixtures of Gaussians. IEEE Symposium on Foundations of Computer Science, 1999 http://dl.acm.org/citation.cfm?id=796496
A. Kalai, A. Moitra, G. Valiant Efficiently Learning Mixtures of Two Gaussians ACM Symposium on Theory of Computing, 2010 http://dl.acm.org/citation.cfm?id=1806765