Linear Programming Boosting (LPBoost) is a supervised classifier from the boosting family of classifiers. LPBoost maximizes a margin between training samples of different classes and hence also belongs to the class of margin-maximizing supervised classification algorithms. Consider a classification function
f:l{X}\to\{-1,1\},
l{X}
As in all boosting classifiers, the final classification function is of the form
f(\boldsymbol{x})=
J | |
\sum | |
j=1 |
\alphajhj(\boldsymbol{x}),
\alphaj
hj:l{X}\to\{-1,1\}
hj
LPBoost constructs
f
\boldsymbol{\alpha}
The property that all classifier weights are adjusted in each iteration is known as totally-corrective property. Early boosting methods, such as AdaBoost do not have this property and converge slower.
More generally, let
l{H}=\{h(⋅ ;\omega)|\omega\in\Omega\}
The primal linear program of LPBoost, optimizing over the non-negative weight vector
\boldsymbol{\alpha}
\boldsymbol{\xi}
\rho
\begin{array}{cl} \underset{\boldsymbol{\alpha},\boldsymbol{\xi},\rho}{min}&-\rho+D
\ell | |
\sum | |
n=1 |
\xin\\ rm{sb.t.}&\sum\omegayn\alpha\omegah(\boldsymbol{x}n;\omega)+\xin\geq\rho, n=1,...,\ell,\\ &\sum\omega\alpha\omega=1,\\ &\xin\geq0, n=1,...,\ell,\\ &\alpha\omega\geq0, \omega\in\Omega,\\ &\rho\in{R}. \end{array}
Note the effects of slack variables
\boldsymbol{\xi}\geq0
D
Here we adopted the notation of a parameter space
\Omega
\omega\in\Omega
h( ⋅ ;\omega):l{X}\to\{-1,1\}
When the above linear program was first written down in early publications about boosting methods it was disregarded as intractable due to the large number of variables
\boldsymbol{\alpha}
In a linear program a column corresponds to a primal variable. Column generation is a technique to solve large linear programs. It typically works in a restricted problem, dealing only with a subset of variables. By generating primal variables iteratively and on-demand, eventually the original unrestricted problem with all variables is recovered. By cleverly choosing the columns to generate the problem can be solved such that while still guaranteeing the obtained solution to be optimal for the original full problem, only a small fraction of columns has to be created.
Columns in the primal linear program corresponds to rows in the dual linear program. The equivalent dual linear program of LPBoost is the following linear program.
\begin{array}{cl} \underset{\boldsymbol{λ},\gamma}{max}&\gamma\\ rm{sb.t.}&
\ell | |
\sum | |
n=1 |
ynh(\boldsymbol{x}n;\omega)λn+\gamma\leq0, \omega\in\Omega,\\ &0\leqλn\leqD, n=1,...,\ell,\\ &
\ell | |
\sum | |
n=1 |
λn=1,\\ &\gamma\inR. \end{array}
For linear programs the optimal value of the primal and dual problem are equal. For the above primal and dual problems, the optimal value is equal to the negative 'soft margin'. The soft margin is the size of the margin separating positive from negative training instances minus positive slack variables that carry penalties for margin-violating samples. Thus, the soft margin may be positive although not all samples are linearly separated by the classification function. The latter is called the 'hard margin' or 'realized margin'.
Consider a subset of the satisfied constraints in the dual problem. For any finite subset we can solve the linear program and thus satisfy all constraints. If we could prove that of all the constraints which we did not add to the dual problem no single constraint is violated, we would have proven that solving our restricted problem is equivalent to solving the original problem. More formally, let
\gamma*
\omega*\in\Omega
\omega*=\underset{\omega\in
\ell | |
\Omega}{rm{argmax}}\sum | |
n=1 |
ynh(\boldsymbol{x}n;\omega)λn.
That is, we search the space
l{H}
h( ⋅ ;\omega*)
D
The positive value of penalization constant
D
D= | 1 |
\ell\nu |
\ell
0<\nu<1
\nu
\nu
k
k | |
\ell |
\leq\nu
\nu
X=\{\boldsymbol{x}1,...,\boldsymbol{x}\ell\}
\boldsymbol{x}i\inl{X}
Y=\{y1,...,y\ell\}
yi\in\{-1,1\}
\theta\geq0
f:l{X}\to\{-1,1\}
λn\leftarrow
1 | |
\ell |
, n=1,...,\ell
\gamma\leftarrow0
J\leftarrow1
\hath\leftarrow\underset{\omega\in
\ell | |
\Omega}{rm{argmax}}\sum | |
n=1 |
ynh(\boldsymbol{x}n;\omega)λn
\ell | |
\sum | |
n=1 |
yn\hath(\boldsymbol{x}n)λn+\gamma\leq\theta
hJ\leftarrow\hath
J\leftarrowJ+1
(\boldsymbol{λ},\gamma)\leftarrow
\boldsymbol{\alpha}\leftarrow
f(\boldsymbol{x}):=
J | |
rm{sign}\left(\sum | |
j=1 |
\alphajhj(\boldsymbol{x})\right)
Note that if the convergence threshold is set to
\theta=0
\theta
The actual margin separating the training samples is termed the realized margin and is defined as
\rho(\boldsymbol{\alpha}):=minn=1,...,\ellyn
\sum | |
\alpha\omega\in\Omega |
\alpha\omegah(\boldsymbol{x}n;\omega).
The realized margin can and will usually be negative in the first iterations. For a hypothesis space that permits singling out of any single sample, as is commonly the case, the realized margin will eventually converge to some positive value.
While the above algorithm is proven to converge, in contrast to other boosting formulations, such as AdaBoost and TotalBoost, there are no known convergence bounds for LPBoost. In practise however, LPBoost is known to converge quickly, often faster than other formulations.
LPBoost is an ensemble learning method and thus does not dictate the choice of base learners, the space of hypotheses
l{H}
The number of base learners commonly used with Boosting in the literature is large. For example, if
l{X}\subseteq{R}n
h(\boldsymbol{x};\omega\in\{1,-1\},p\in\{1,...,n\},t\in{R}):= \left\{\begin{array}{cl}\omega&rm{if~}\boldsymbol{x}p\leqt\\ -\omega&rm{otherwise}\end{array}\right..
The above decision stumps looks only along a single dimension
p
t
\omega
Given weights for the training samples, constructing the optimal decision stump of the above form simply involves searching along all sample columns and determining
p
t
\omega