Vapnik–Chervonenkis theory (also known as VC theory) was developed during 1960–1990 by Vladimir Vapnik and Alexey Chervonenkis. The theory is a form of computational learning theory, which attempts to explain the learning process from a statistical point of view.
VC theory covers at least four parts (as explained in The Nature of Statistical Learning Theory):
VC Theory is a major subbranch of statistical learning theory. One of its main applications in statistical learning theory is to provide generalization conditions for learning algorithms. From this point of view, VC theory is related to stability, which is an alternative approach for characterizing generalization.
In addition, VC theory and VC dimension are instrumental in the theory of empirical processes, in the case of processes indexed by VC classes. Arguably these are the most important applications of the VC theory, and are employed in proving generalization. Several techniques will be introduced that are widely used in the empirical process and VC theory. The discussion is mainly based on the book Weak Convergence and Empirical Processes: With Applications to Statistics.
Let
(l{X},l{A})
Q
(l{X},l{A})
f:l{X}\toR
Qf=\intfdQ
Measurability issues will be ignored here, for more technical detail see. Let
l{F}
f:l{X}\toR
\|Q\|l{F
Let
X1,\ldots,Xn
(l{X},l{A})
Pn=n-1
n | |
\sum | |
i=1 |
\delta | |
Xi |
,
where here stands for the Dirac measure. The empirical measure induces a map
l{F}\toR
f\mapstoPnf=
1 | |
n |
(f(X1)+...+f(Xn))
Now suppose is the underlying true distribution of the data, which is unknown. Empirical Processes theory aims at identifying classes
l{F}
\|Pn-P\|l{F
That is, as
n\toinfty
\left|
1 n (f(X1)+...+f(Xn))-\intfdP\right|\to0
uniformly for all
f\inl{F}
Gn=\sqrt{n}(Pn-P)\rightsquigarrowG, in\ellinfty(l{F})
In the former case
l{F}
\forallx,\sup\nolimitsf
l{F}
These statements are true for a single
f
f\inl{F}
l{F}
l{F}
One way of measuring how big the function set
l{F}
N(\varepsilon,l{F},\| ⋅ \|)
is the minimal number of balls
\{g:\|g-f\|<\varepsilon\}
l{F}
l{F}
Two sufficient conditions are provided below, under which it can be proved that the set
l{F}
A class
l{F}
P\astF<infty
\forall\varepsilon>0 \sup\nolimitsQN(\varepsilon\|F\|Q,l{F},L1(Q))<infty.
The next condition is a version of the celebrated Dudley's theorem. If
l{F}
infty | |
\int | |
0 |
\sup\nolimitsQ\sqrt{logN\left(\varepsilon\|F\|Q,2,l{F},L2(Q)\right)}d\varepsilon<infty
then
l{F}
P\astF2<infty
\|f\|Q,2=\left(\int|f|2d
| ||||
Q\right) |
The majority of the arguments of how to bound the empirical process rely on symmetrization, maximal and concentration inequalities, and chaining. Symmetrization is usually the first step of the proofs, and since it is used in many machine learning proofs on bounding empirical loss functions (including the proof of the VC inequality which is discussed in the next section) it is presented here.
Consider the empirical process:
f\mapsto(Pn-P)f=\dfrac{1}{n}
n | |
\sum | |
i=1 |
(f(Xi)-Pf)
Turns out that there is a connection between the empirical and the following symmetrized process:
f\mapsto
0 | |
P | |
n |
f=\dfrac{1}{n}
n | |
\sum | |
i=1 |
\varepsilonif(Xi)
The symmetrized process is a Rademacher process, conditionally on the data
Xi
Lemma (Symmetrization). For every nondecreasing, convex and class of measurable functions
l{F}
E\Phi(\|Pn-P\|l{F
The proof of the Symmetrization lemma relies on introducing independent copies of the original variables
Xi
A typical way of proving empirical CLTs, first uses symmetrization to pass the empirical process to
0 | |
P | |
n |
It turns out that there is a fascinating connection between certain combinatorial properties of the set
l{F}
Consider a collection
l{C}
l{X}
l{C}
W
S=\{x1,\ldots,xn\}\subsetl{X}
W=S\capC
C\inl{C}
l{C}
V(l{C})
l{C}
l{C}
Sauer's lemma then states that the number
\Deltan(l{C},x1,\ldots,xn)
l{C}
max | |
x1,\ldots,xn |
\Deltan(l{C},x1,\ldots,xn)\leq
V(l{C | |
\sum | |
j=0 |
)-1}{n\choosej}\leq\left(
ne | |
V(l{C |
)-1}\right)V(l{C)-1}
Which is a polynomial number
O(nV(l{C)-1})
l{C}
A similar bound can be shown (with a different constant, same rate) for the so-called VC subgraph classes. For a function
f:l{X}\toR
l{X} x R
\{(x,t):t<f(x)\}
l{F}
Consider a set of indicator functions
l{I}l{C
L1(Q)
r\geq1
N(\varepsilon,l{I}l{C
Further consider the symmetric convex hull of a set
l{F}
\operatorname{sconv}l{F}
m | |
\sum | |
i=1 |
\alphaifi
m | |
\sum | |
i=1 |
|\alphai|\leq1
N\left(\varepsilon\|F\|Q,2,l{F},L2(Q)\right)\leqC\varepsilon-V
the following is valid for the convex hull of
l{F}
logN\left(\varepsilon\|F\|Q,2,\operatorname{sconv}l{F},L2(Q)\right)\leqK
| ||||
\varepsilon |
The important consequence of this fact is that
2V | |
V+2 |
<2,
which is just enough so that the entropy integral is going to converge, and therefore the class
\operatorname{sconv}l{F}
Finally an example of a VC-subgraph class is considered. Any finite-dimensional vector space
l{F}
f:l{X}\toR
\dim(l{F})+2
Proof: Take
n=\dim(l{F})+2
(x1,t1),\ldots,(xn,tn)
(f(x1),\ldots,f(xn))-(t1,\ldots,tn)
\sum | |
ai>0 |
ai(f(xi)-ti)=
\sum | |
ai<0 |
(-ai)(f(xi)-ti), \forallf\inl{F}
S=\{(xi,ti):ai>0\}
f
S=\{(xi,ti):f(xi)>ti\}
There are generalizations of the notion VC subgraph class, e.g. there is the notion of pseudo-dimension.
A similar setting is considered, which is more common to machine learning. Let
l{X}
l{Y}=\{0,1\}
f:l{X}\tol{Y}
l{F}
S(l{F},n)=
max | |
x1,\ldots,xn |
|\{(f(x1),\ldots,f(xn)),f\inl{F}\}|
Note here that there is a 1:1 go between each of the functions in
l{F}
l{C}
f\inl{F}
max | |
x1,\ldots,xn |
\Deltan(l{C},x1,\ldots,xn)
This equivalence together with Sauer's Lemma implies that
S(l{F},n)
l{C}
Let
Dn=\{(X1,Y1),\ldots,(Xn,Ym)\}
PXY
R(f)=P(f(X) ≠ Y)
PXY
R(f)
\hat{R}n(f)=
n | |
\dfrac{1}{n}\sum | |
i=1 |
I(f(Xi) ≠ Yi)
can certainly be evaluated. Then one has the following Theorem:
For binary classification and the 0/1 loss function we have the following generalization bounds:
\begin{align} P\left(\supf
In words the VC inequality is saying that as the sample increases, provided that
l{F}
S(l{F},n)
The connection between this framework and the Empirical Process framework is evident. Here one is dealing with a modified empirical process
\left|\hat{R}n-R\right|l{F
but not surprisingly the ideas are the same. The proof of the (first part of) VC inequality, relies on symmetrization, and then argue conditionally on the data using concentration inequalities (in particular Hoeffding's inequality). The interested reader can check the book Theorems 12.4 and 12.5.