In computational learning theory (machine learning and theory of computation), Rademacher complexity, named after Hans Rademacher, measures richness of a class of sets with respect to a probability distribution. The concept can also be extended to real valued functions.
Given a set
A\subseteqRm
\operatorname{Rad}(A) :=
1 | |
m |
E\sigma\left[ \supa
m | |
\sum | |
i=1 |
\sigmaiai \right]
where
\sigma1,\sigma2,...,\sigmam
\Pr(\sigmai=+1)=\Pr(\sigmai=-1)=1/2
i=1,2,...,m
a=(a1,\ldots,am)
A
Let
S=\{z1,z2,...,zm\}\subsetZ
l{F}
Z
l{F}
S
\operatorname{Rad}S(l{F}) =
1 | |
m |
E\sigma\left[ \supf
This can also be written using the previous definition:[2]
\operatorname{Rad}S(l{F})=\operatorname{Rad}(l{F}\circS)
l{F}\circS
l{F}\circS:=\{(f(z1),\ldots,f(zm))\midf\inl{F}\}
Let
P
Z
l{F}
P
m
\operatorname{Rad}P,m(l{F}) :=
E | |
S\simPm |
\left[\operatorname{Rad}S(l{F})\right]
where the above expectation is taken over an identically independently distributed (i.i.d.) sample
S=(z1,z2,...,zm)
P
The Rademacher complexity is typically applied on a function class of models that are used for classification, with the goal of measuring their ability to classify points drawn from a probability space under arbitrary labellings. When the function class is rich enough, it contains functions that can appropriately adapt for each arrangement of labels, simulated by the random draw of
\sigmai
1.
A
A=\{(a,b)\}\subsetR2
\operatorname{Rad}(A)={1\over2} ⋅ \left({1\over4} ⋅ (a+b)+{1\over4} ⋅ (a-b)+{1\over4} ⋅ (-a+b)+{1\over4} ⋅ (-a-b)\right)=0
2.
A
A=\{(1,1),(1,2)\}\subsetR2
\begin{align} \operatorname{Rad}(A)&={1\over2} ⋅ \left({1\over4} ⋅ max(1+1,1+2)+{1\over4} ⋅ max(1-1,1-2)+{1\over4} ⋅ max(-1+1,-1+2)+{1\over4} ⋅ max(-1-1,-1-2)\right)\\[5pt] &={1\over8}(3+0+1-2)={1\over4} \end{align}
The Rademacher complexity can be used to derive data-dependent upper-bounds on the learnability of function classes. Intuitively, a function-class with smaller Rademacher complexity is easier to learn.
In machine learning, it is desired to have a training set that represents the true distribution of some sample data
S
P
H
F
h\inH
fh\inF
h
h
fh
h
f
fh
LP(f):=Ez\sim[f(z)]
f\inF
P
LS(f):={1\overm}
m | |
\sum | |
i=1 |
f(zi)
f\inF
S
S
P
F
\operatorname{Rep}P(F,S):=\supf\in(LP(f)-LS(f))
Smaller representativeness is better, since it provides a way to avoid overfitting: it means that the true error of a classifier is not much higher than its estimated error, and so selecting a classifier that has low estimated error will ensure that the true error is also low. Note however that the concept of representativeness is relative and hence can not be compared across distinct samples.
The expected representativeness of a sample can be bounded above by the Rademacher complexity of the function class:[2]
When the Rademacher complexity is small, it is possible to learn the hypothesis class H using empirical risk minimization.
For example, (with binary error function),[2] for every
\delta>0
1-\delta
h\inH
LP(h)-LS(h)\leq2\operatorname{Rad}(F\circS)+4\sqrt{2ln(4/\delta)\overm}
Since smaller Rademacher complexity is better, it is useful to have upper bounds on the Rademacher complexity of various function sets. The following rules can be used to upper-bound the Rademacher complexity of a set
A\subsetRm
1. If all vectors in
A
a0\inRm
2. If all vectors in
A
c\inR
|c|
3.
\operatorname{Rad}(A+B)=\operatorname{Rad}(A)+\operatorname{Rad}(B)
4. (Kakade & Tewari Lemma) If all vectors in
A
A
5. The Rademacher complexity of the convex hull of
A
6. (Massart Lemma) The Rademacher complexity of a finite set grows logarithmically with the set size. Formally, let
A
N
Rm
\bar{a}
A
\operatorname{Rad}(A)\leqmaxa\in\|a-\bar{a}\| ⋅ {\sqrt{2logN}\overm}
A
\sqrt{m}
\operatorname{Rad}(A)\leq\sqrt{2logN\overm}
Let
H
d
H
for all
m>d+1
\operatorname{Growth}(H,m)\leq(em/d)d
h
m
|H\caph|\leq(em/d)d
H\caph
Rm
\operatorname{Rad}(H\caph)\leq{\sqrt{2dlog(em/d)\overm}}
With more advanced techniques (Dudley's entropy bound and Haussler's upper bound[3]) one can show, for example, that there exists a constant
C
\{0,1\}
d
C\sqrt{ | d |
m |
The following bounds are related to linear operations on
S
m
Rn
1. Define
A2=\{(w ⋅ x1,\ldots,w ⋅ xm)\mid\|w\|2\leq1\}=
S
\operatorname{Rad}(A2)\leq{maxi\|xi\|2\over\sqrt{m}}
2. Define
A1=\{(w ⋅ x1,\ldots,w ⋅ xm)\mid\|w\|1\leq1\}=
S
\operatorname{Rad}(A1)\leqmaxi\|xi\|infty ⋅ \sqrt{2log(2n)\overm}
The following bound relates the Rademacher complexity of a set
A
r
A
Suppose
A\subsetRm
c
M>0
\operatorname{Rad}(A)\leq{c ⋅ 2-M\over\sqrt{m}} + {6c\over
M | |
m} ⋅ \sum | |
i=1 |
2-i
ext | |
\sqrt{log\left(N | |
c ⋅ 2-i |
(A)\right)}
In particular, if
A
Rm
\forallr>0:
ext | |
N | |
r(A) |
\leq(2c\sqrt{d}/r)d
\operatorname{Rad}(A)\leq{6c\overm} ⋅ (\sqrt{dlog(2\sqrt{d})}+2\sqrt{d}) = O({c\sqrt{dlog(d)}\overm})
Gaussian complexity is a similar complexity with similar physical meanings, and can be obtained from the Rademacher complexity using the random variables
gi
\sigmai
gi
gi\siml{N}(0,1)
Given a set
A\subseteqRn
G(A) | |
2\sqrt{log{n |
G(A)
\sqrt{logd}