Rademacher complexity explained

In computational learning theory (machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of sets with respect to a probability distribution. The concept can also be extended to real valued functions.

Definitions

Rademacher complexity of a set

Given a set

A\subseteqR^m

, the Rademacher complexity of A is defined as follows:^[1] ^[2]

\operatorname{Rad}(A) :=

	1
	m

E_\sigma\left[ \sup_a

	m
\sum
	i=1

\sigma_ia_i
\right]

where

\sigma_1,\sigma_2,...,\sigma_m

are independent random variables drawn from the Rademacher distribution i.e.

\Pr(\sigma_i=+1)=\Pr(\sigma_i=-1)=1/2

for

i=1,2,...,m

, and

a=(a_1,\ldots,a_m)

. Some authors take the absolute value of the sum before taking the supremum, but if

is symmetric this makes no difference.

Rademacher complexity of a function class

Let

S=\{z_1,z_2,...,z_m\}\subsetZ

be a sample of points and consider a function class

l{F}

of real-valued functions over

. Then, the empirical Rademacher complexity of

l{F}

given

is defined as:

\operatorname{Rad}_S(l{F})
=

	1
	m

E_\sigma\left[ \sup_f

} \sum_^m \sigma_i f(z_i) \right]

This can also be written using the previous definition:^[2]

\operatorname{Rad}_S(l{F})=\operatorname{Rad}(l{F}\circS)

where

l{F}\circS

denotes function composition, i.e.:

l{F}\circS:=\{(f(z_{1),\ldots,f(z}_m))\midf\inl{F}\}

Let

be a probability distribution over

. The Rademacher complexity of the function class

l{F}

with respect to

for sample size

is:

\operatorname{Rad}_P,m(l{F}) :=

E
	S\simP^m

\left[\operatorname{Rad}_{S(l{F})\right]}

where the above expectation is taken over an identically independently distributed (i.i.d.) sample

S=(z_1,z_2,...,z_m)

generated according to

Intuition

The Rademacher complexity is typically applied on a function class of models that are used for classification, with the goal of measuring their ability to classify points drawn from a probability space under arbitrary labellings. When the function class is rich enough, it contains functions that can appropriately adapt for each arrangement of labels, simulated by the random draw of

\sigma_i

under the expectation, so that this quantity in the sum is maximised.

Examples

contains a single vector, e.g.,

A=\{(a,b)\}\subsetR²

. Then:

\operatorname{Rad}(A)={1\over2} ⋅ \left({1\over4} ⋅ (a+b)+{1\over4} ⋅ (a-b)+{1\over4} ⋅ (-a+b)+{1\over4} ⋅ (-a-b)\right)=0

The same is true for every singleton hypothesis class.

contains two vectors, e.g.,

A=\{(1,1),(1,2)\}\subsetR²

. Then:

\begin{align} \operatorname{Rad}(A)&={1\over2} ⋅ \left({1\over4} ⋅ max(1+1,1+2)+{1\over4} ⋅ max(1-1,1-2)+{1\over4} ⋅ max(-1+1,-1+2)+{1\over4} ⋅ max(-1-1,-1-2)\right)\\[5pt] &={1\over8}(3+0+1-2)={1\over4} \end{align}

Using the Rademacher complexity

The Rademacher complexity can be used to derive data-dependent upper-bounds on the learnability of function classes. Intuitively, a function-class with smaller Rademacher complexity is easier to learn.

Bounding the representativeness

In machine learning, it is desired to have a training set that represents the true distribution of some sample data

. This can be quantified using the notion of representativeness. Denote by

the probability distribution from which the samples are drawn. Denote by

the set of hypotheses (potential classifiers) and denote by

the corresponding set of error functions, i.e., for every hypothesis

h\inH

, there is a function

f_h\inF

, that maps each training sample (features,label) to the error of the classifier

(note in this case hypothesis and classifier are used interchangeably). For example, in the case that

represents a binary classifier, the error function is a 0–1 loss function, i.e. the error function

f_h

returns 0 if

correctly classifies a sample and 1 else. We omit the index and write

instead of

f_h

when the underlying hypothesis is irrelevant. Define:

L_P(f):=E_z\sim[f(z)]

– the expected error of some error function

f\inF

on the real distribution

;

L_S(f):={1\overm}

	m
\sum
	i=1

f(z_i)

– the estimated error of some error function

f\inF

on the sample

.The representativeness of the sample

, with respect to

and

, is defined as:

\operatorname{Rep}_P(F,S):=\sup_f\in(L_P(f)-L_S(f))

Smaller representativeness is better, since it provides a way to avoid overfitting: it means that the true error of a classifier is not much higher than its estimated error, and so selecting a classifier that has low estimated error will ensure that the true error is also low. Note however that the concept of representativeness is relative and hence can not be compared across distinct samples.

The expected representativeness of a sample can be bounded above by the Rademacher complexity of the function class:^[2]

$\mathbb E_ [\operatorname{Rep}_P(F,S)] \leq 2 \cdot \mathbb E_ [\operatorname{Rad}(F\circ S)]$

Bounding the generalization error

When the Rademacher complexity is small, it is possible to learn the hypothesis class H using empirical risk minimization.

For example, (with binary error function),^[2] for every

\delta>0

, with probability at least

1-\delta

, for every hypothesis

h\inH

L_P(h)-L_S(h)\leq2\operatorname{Rad}(F\circS)+4\sqrt{2ln(4/\delta)\overm}

Bounding the Rademacher complexity

Since smaller Rademacher complexity is better, it is useful to have upper bounds on the Rademacher complexity of various function sets. The following rules can be used to upper-bound the Rademacher complexity of a set

A\subsetR^m

.^[2]

1. If all vectors in

are translated by a constant vector

a₀\inR^m

, then Rad(A) does not change.

2. If all vectors in

are multiplied by a scalar

c\inR

, then Rad(A) is multiplied by

|c|

\operatorname{Rad}(A+B)=\operatorname{Rad}(A)+\operatorname{Rad}(B)

4. (Kakade & Tewari Lemma) If all vectors in

are operated by a Lipschitz function, then Rad(A) is (at most) multiplied by the Lipschitz constant of the function. In particular, if all vectors in

are operated by a contraction mapping, then Rad(A) strictly decreases.

5. The Rademacher complexity of the convex hull of

equals Rad(A).

6. (Massart Lemma) The Rademacher complexity of a finite set grows logarithmically with the set size. Formally, let

be a set of

vectors in

R^m

, and let

\bar{a}

be the mean of the vectors in

. Then:

\operatorname{Rad}(A)\leqmax_a\in\|a-\bar{a}\| ⋅ {\sqrt{2logN}\overm}

In particular, if

is a set of binary vectors, the norm is at most

\sqrt{m}

, so:

\operatorname{Rad}(A)\leq\sqrt{2logN\overm}

Bounds related to the VC dimension

Let

be a set family whose VC dimension is

. It is known that the growth function of

is bounded as:

for all

m>d+1

\operatorname{Growth}(H,m)\leq(em/d)^d

This means that, for every set

with at most

elements,

|H\caph|\leq(em/d)^d

. The set-family

H\caph

can be considered as a set of binary vectors over

R^m

. Substituting this in Massart's lemma gives:

\operatorname{Rad}(H\caph)\leq{\sqrt{2dlog(em/d)\overm}}

With more advanced techniques (Dudley's entropy bound and Haussler's upper bound^[3]) one can show, for example, that there exists a constant

, such that any class of

\{0,1\}

-indicator functions with Vapnik–Chervonenkis dimension

has Rademacher complexity upper-bounded by

C\sqrt{	d
	m

Bounds related to linear classes

The following bounds are related to linear operations on

– a constant set of

vectors in

Rⁿ

.^[2]

1. Define

A₂=\{(w ⋅ x_{1,\ldots,w ⋅}x_m)\mid\|w\|_2\leq1\}=

the set of dot-products of the vectors in

with vectors in the unit ball. Then:

\operatorname{Rad}(A₂₎\leq{max_i\|x_i\|₂\over\sqrt{m}}

2. Define

A₁=\{(w ⋅ x_{1,\ldots,w ⋅}x_m)\mid\|w\|_1\leq1\}=

the set of dot-products of the vectors in

with vectors in the unit ball of the 1-norm. Then:

\operatorname{Rad}(A₁₎\leqmax_i\|x_i\|_{infty ⋅}\sqrt{2log(2n)\overm}

Bounds related to covering numbers

The following bound relates the Rademacher complexity of a set

to its external covering number – the number of balls of a given radius

whose union contains

. The bound is attributed to Dudley.^[2]

Suppose

A\subsetR^m

is a set of vectors whose length (norm) is at most

. Then, for every integer

M>0

\operatorname{Rad}(A)\leq{c ⋅ 2^-M\over\sqrt{m}} + {6c\over

	M
m} ⋅ \sum
	i=1

2^-i

	ext
\sqrt{log\left(N
	c ⋅ 2^-i

(A)\right)}

In particular, if

lies in a d-dimensional subspace of

R^m

, then:

\forallr>0:

	ext
N
	r(A)

\leq(2c\sqrt{d}/r)^d

Substituting this in the previous bound gives the following bound on the Rademacher complexity:

\operatorname{Rad}(A)\leq{6c\overm} ⋅ (\sqrt{dlog(2\sqrt{d})}+2\sqrt{d}) = O({c\sqrt{dlog(d)}\overm})

Gaussian complexity

Gaussian complexity is a similar complexity with similar physical meanings, and can be obtained from the Rademacher complexity using the random variables

g_i

instead of

\sigma_i

, where

g_i

are Gaussian i.i.d. random variables with zero-mean and variance 1, i.e.

g_i\siml{N}(0,1)

. Gaussian and Rademacher complexities are known to be equivalent up to logarithmic factors.

Equivalence of Rademacher and Gaussian complexity

Given a set

A\subseteqRⁿ

then it holds that^[4] :

	G(A)
	2\sqrt{log{n

}} \leq \text(A) \leq \sqrtG(A)
Where

G(A)

is the Gaussian Complexity of A. As an example, consider the rademacher and gaussian complexities of the L1 ball. The Rademacher complexity is given by exactly 1, whereas the Gaussian complexity is on the order of

\sqrt{logd}

(which can be shown by applying known properties of suprema of a set of subgaussian random variables).

References

Peter L. Bartlett, Shahar Mendelson (2002) Rademacher and Gaussian Complexities: Risk Bounds and Structural Results. Journal of Machine Learning Research 3 463–482
Giorgio Gnecco, Marcello Sanguineti (2008) Approximation Error Bounds via Rademacher's Complexity. Applied Mathematical Sciences, Vol. 2, 2008, no. 4, 153–176

Notes and References

Web site: Balcan. Maria-Florina. Maria-Florina Balcan . Machine Learning Theory – Rademacher Complexity. 10 December 2016. November 15–17, 2011.
Chapter 26 in
Bousquet, O. (2004). Introduction to Statistical Learning Theory. Biological Cybernetics, 3176(1), 169–207.
Book: Wainwright, Martin . High-dimensional statistics : a non-asymptotic viewpoint . 2019 . 978-1-108-62777-1 . Cambridge, United Kingdom . Exercise 5.5 . 1089254580.