Stability (learning theory) explained

Stability, also known as algorithmic stability, is a notion in computational learning theory of how a machine learning algorithm output is changed with small perturbations to its inputs. A stable learning algorithm is one for which the prediction does not change much when the training data is modified slightly. For instance, consider a machine learning algorithm that is being trained to recognize handwritten letters of the alphabet, using 1000 examples of handwritten letters and their labels ("A" to "Z") as a training set. One way to modify this training set is to leave out an example, so that only 999 examples of handwritten letters and their labels are available. A stable learning algorithm would produce a similar classifier with both the 1000-element and 999-element training sets.

Stability can be studied for many types of learning problems, from language learning to inverse problems in physics and engineering, as it is a property of the learning process rather than the type of information being learned. The study of stability gained importance in computational learning theory in the 2000s when it was shown to have a connection with generalization.[1] It was shown that for large classes of learning algorithms, notably empirical risk minimization algorithms, certain types of stability ensure good generalization.

History

A central goal in designing a machine learning system is to guarantee that the learning algorithm will generalize, or perform accurately on new examples after being trained on a finite number of them. In the 1990s, milestones were reached in obtaining generalization bounds for supervised learning algorithms. The technique historically used to prove generalization was to show that an algorithm was consistent, using the uniform convergence properties of empirical quantities to their means. This technique was used to obtain generalization bounds for the large class of empirical risk minimization (ERM) algorithms. An ERM algorithm is one that selects a solution from a hypothesis space

H

in such a way to minimize the empirical error on a training set

S

.

A general result, proved by Vladimir Vapnik for an ERM binary classification algorithms, is that for any target function and input distribution, any hypothesis space

H

with VC-dimension

d

, and

n

training examples, the algorithm is consistent and will produce a training error that is at most
O\left(\sqrt{d
n
}\right) (plus logarithmic factors) from the true error. The result was later extended to almost-ERM algorithms with function classes that do not have unique minimizers.

Vapnik's work, using what became known as VC theory, established a relationship between generalization of a learning algorithm and properties of the hypothesis space

H

of functions being learned. However, these results could not be applied to algorithms with hypothesis spaces of unbounded VC-dimension. Put another way, these results could not be applied when the information being learned had a complexity that was too large to measure. Some of the simplest machine learning algorithms—for instance, for regression—have hypothesis spaces with unbounded VC-dimension. Another example is language learning algorithms that can produce sentences of arbitrary length.

Stability analysis was developed in the 2000s for computational learning theory and is an alternative method for obtaining generalization bounds. The stability of an algorithm is a property of the learning process, rather than a direct property of the hypothesis space

H

, and it can be assessed in algorithms that have hypothesis spaces with unbounded or undefined VC-dimension such as nearest neighbor. A stable learning algorithm is one for which the learned function does not change much when the training set is slightly modified, for instance by leaving out an example. A measure of Leave one out error is used in a Cross Validation Leave One Out (CVloo) algorithm to evaluate a learning algorithm's stability with respect to the loss function. As such, stability analysis is the application of sensitivity analysis to machine learning.

Summary of classic results

L

, traced to Andrey Nikolayevich Tikhonov.

Preliminary definitions

We define several terms related to learning algorithms training sets, so that we can then define stability in multiple ways and present theorems from the field.

A machine learning algorithm, also known as a learning map

L

, maps a training data set, which is a set of labeled examples

(x,y)

, onto a function

f

from

X

to

Y

, where

X

and

Y

are in the same space of the training examples. The functions

f

are selected from a hypothesis space of functions called

H

.

The training set from which an algorithm learns is defined as

S=\{z1=(x1,y1),..,zm=(xm,ym)\}

and is of size

m

in

Z=X x Y

drawn i.i.d. from an unknown distribution D.

Thus, the learning map

L

is defined as a mapping from

Zm

into

H

, mapping a training set

S

onto a function

fS

from

X

to

Y

. Here, we consider only deterministic algorithms where

L

is symmetric with respect to

S

, i.e. it does not depend on the order of the elements in the training set. Furthermore, we assume that all functions are measurable and all sets are countable.

The loss

V

of a hypothesis

f

with respect to an example

z=(x,y)

is then defined as

V(f,z)=V(f(x),y)

.

The empirical error of

f

is

IS[f]=

1
n

\sumV(f,zi)

.

The true error of

f

is

I[f]=EzV(f,z)

Given a training set S of size m, we will build, for all i = 1....,m, modified training sets as follows:

S|i=\{z1,...,zi-1,zi+1,...,zm\}

Si=\{z1,...,zi-1,zi',zi+1,...,zm\}

Definitions of stability

Hypothesis Stability

An algorithm

L

has hypothesis stability β with respect to the loss function V if the following holds:

\foralli\in\{1,...,m\},ES,z[|V(fS,z)-V(f

S|i

,z)|]\leq\beta.

Point-wise Hypothesis Stability

An algorithm

L

has point-wise hypothesis stability β with respect to the loss function V if the following holds:

\foralli\in\{1,...,m\},ES[|V(fS,zi)-V(f

S|i

,zi)|]\leq\beta.

Error Stability

An algorithm

L

has error stability β with respect to the loss function V if the following holds:

\forallS\inZm,\foralli\in\{1,...,m\},|Ez[V(fS,z)]-Ez[V(f

S|i

,z)]|\leq\beta

Uniform Stability

An algorithm

L

has uniform stability β with respect to the loss function V if the following holds:

\forallS\inZm,\foralli\in\{1,...,m\},\supz\in|V(fS,z)-V(f

S|i

,z)|\leq\beta

A probabilistic version of uniform stability β is:

\forallS\inZm,\foralli\in\{1,...,m\},PS\{\supz\in|V(fS,z)-V(f

S|i

,z)|\leq\beta\}\geq1-\delta

An algorithm is said to be stable, when the value of

\beta

decreases as
O(1
m

)

.

Leave-one-out cross-validation (CVloo) Stability

An algorithm

L

has CVloo stability β with respect to the loss function V if the following holds:

\foralli\in\{1,...,m\},PS\{|V(fS,zi)-

V(f
S|i

,zi)|\leq\betaCV\}\geq1-\deltaCV

The definition of (CVloo) Stability is equivalent to Pointwise-hypothesis stability seen earlier.

Expected-leave-one-out error (

Elooerr

) Stability

An algorithm

L

has

Elooerr

stability if for each n there exists a
m
\beta
EL
and a
m
\delta
EL
such that:

\foralli\in\{1,...,m\},PS\{|I[f

S]-1
m
m
\sum
i=1
V(f
S|i

,zi)|\leq\beta

m
EL
, with
m
\beta
EL
and
m
\delta
EL
going to zero for

m, → infty

Classic theorems

From Bousquet and Elisseeff (02):

For symmetric learning algorithms with bounded loss, if the algorithm has Uniform Stability with the probabilistic definition above, then the algorithm generalizes.

Uniform Stability is a strong condition which is not met by all algorithms but is, surprisingly, met by the large and important class of Regularization algorithms.The generalization bound is given in the article.

From Mukherjee et al. (06):

Elooerr

) Stability as defined above, then the algorithm generalizes.

This is an important result for the foundations of learning theory, because it shows that two previously unrelated properties of an algorithm, stability and consistency, are equivalent for ERM (and certain loss functions).The generalization bound is given in the article.

Algorithms that are stable

This is a list of algorithms that have been shown to be stable, and the article where the associated generalization bounds are provided.

C

leads to good stability.

k

of regressors increasing with

n

.[10]

Further reading

Notes and References

  1. Bousquet . Olivier . Elisseeff . André . 2002 . Stability and Generalization . Journal of Machine Learning Research . 2 . Mar . 499–526 . 1533-7928.
  2. L. Devroye and Wagner, Distribution-free performance bounds for potential function rules, IEEE Trans. Inf. Theory 25(5) (1979) 601–604.
  3. M. Kearns and D. Ron, Algorithmic stability and sanity-check bounds for leave-one-out cross-validation, Neural Comput. 11(6) (1999) 1427–1453.
  4. O. Bousquet and A. Elisseeff. Stability and generalization. J. Mach. Learn. Res., 2:499–526, 2002.
  5. S. Kutin and P. Niyogi, Almost-everywhere algorithmic stability and generalization error, Technical Report TR-2002-03, University of Chicago (2002).
  6. S. Mukherjee, P. Niyogi, T. Poggio, and R. M. Rifkin. Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization. Adv. Comput. Math., 25(1-3):161–193, 2006.
  7. Shalev Shwartz, S., Shamir, O., Srebro, N., Sridharan, K., Learnability, Stability and Uniform Convergence, Journal of Machine Learning Research, 11(Oct):2635-2670, 2010.
  8. Moritz Hardt, Benjamin Recht, Yoram Singer, Train faster, generalize better: Stability of stochastic gradient descent, ICML 2016.
  9. Elisseeff, A. A study about algorithmic stability and their relation to generalization performances. Technical report. (2000)
  10. Rifkin, R. Everything Old is New Again: A fresh look at historical approaches in machine learning. Ph.D. Thesis, MIT, 2002
  11. Rosasco, L. and Poggio, T. Stability of Tikhonov Regularization, 2009