In computational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.
Occam learnability implies PAC learning, and for a wide variety of concept classes, the converse is also true: PAC learnability implies Occam learnability.
Occam Learning is named after Occam's razor, which is a principle stating that, given all other things being equal, a shorter explanation for observed data should be favored over a lengthier explanation. The theory of Occam learning is a formal and mathematical justification for this principle. It was first shown by Blumer, et al. that Occam learning implies PAC learning, which is the standard model of learning in computational learning theory. In other words, parsimony (of the output hypothesis) implies predictive power.
The succinctness of a concept
c
l{C}
size(c)
c
l{C}
Let
l{C}
l{H}
\alpha\ge0
0\le\beta<1
L
(\alpha,\beta)
l{C}
l{H}
S=\{x1,...,xm\}
m
c\inl{C}
L
h\inl{H}
h
c
S
h(x)=c(x),\forallx\inS
size(h)\le(n ⋅ size(c))\alpham\beta
n
x\inS
n
m
size(c).
l{C}
l{H}
l{C}
l{H}.
Occam learnability implies PAC learnability, as the following theorem of Blumer, et al. shows:
LetHere,be an efficientL
-Occam algorithm for(\alpha,\beta)
usingl{C}
. Then there exists a constantl{H}
such that for anya>0
, for any distribution0<\epsilon,\delta<1
, givenl{D}
samples drawn fromm\gea\left(
1 \epsilon log
1 \delta +\left(
(n ⋅ size(c))\alpha) \epsilon
1 1-\beta \right) \right)
and labelled according to a conceptl{D}
of lengthc\inl{C}
bits each, the algorithmn
will output a hypothesisL
such thath\inl{H}
with probability at leasterror(h)\le\epsilon
.1-\delta
error(h)
c
l{D}
L
l{C}
l{H}
Let. Let0<\epsilon,\delta<1
be an algorithm such that, givenL
samples drawn from a fixed but unknown distributionm
and labeled according to a conceptl{D}
of lengthc\inl{C}
bits each, outputs a hypothesisn
that is consistent with the labeled samples. Then, there exists a constanth\inl{H}n,m
such that ifb
, thenlog|l{H}n,m|\leqb\epsilonm-log
1 \delta is guaranteed to output a hypothesisL
such thath\inl{H}n,m
with probability at leasterror(h)\le\epsilon
.1-\delta
While the above theorems show that Occam learning is sufficient for PAC learning, it doesn't say anything about necessity. Board and Pitt show that, for a wide variety of concept classes, Occam learning is in fact necessary for PAC learning.[3] They proved that for any concept class that is polynomially closed under exception lists, PAC learnability implies the existence of an Occam algorithm for that concept class. Concept classes that are polynomially closed under exception lists include Boolean formulas, circuits, deterministic finite automata, decision-lists, decision-trees, and other geometrically-defined concept classes.
A concept class
l{C}
A
c\inl{C}
E
c'\inl{C}
c
c'
E
We first prove the Cardinality version. Call a hypothesis
h\inl{H}
error(h)\geq\epsilon
error(h)
c
l{D}
S
h
(1-\epsilon)m
l{H}n,m
|l{H}n,m|(1-\epsilon)m
\delta
log|l{H}n,m|\leqO(\epsilonm)-log
1 | |
\delta |
Using the second theorem, we can prove the first theorem. Since we have a
(\alpha,\beta)
L
(n ⋅ size(c))\alpham\beta
log|l{H}n,m|\leq(n ⋅ size(c))\alpham\beta
O(\epsilonm)-log
1 | |
\delta |
m\geqa\left(
1 | |
\epsilon |
log
1 | |
\delta |
+\left(
(n ⋅ size(c))\alpha) | |
\epsilon |
| ||||
\right) |
\right)
a>0
L
h
1-\delta
Though Occam and PAC learnability are equivalent, the Occam framework can be used to produce tighter bounds on the sample complexity of classical problems including conjunctions, conjunctions with few relevant variables,[4] and decision lists.[5]
Occam algorithms have also been shown to be successful for PAC learning in the presence of errors,[6] [7] probabilistic concepts,[8] function learning[9] and Markovian non-independent examples.[10]