The structured support-vector machine is a machine learning algorithm that generalizes the Support-Vector Machine (SVM) classifier. Whereas the SVM classifier supports binary classification, multiclass classification and regression, the structured SVM allows training of a classifier for general structured output labels.
As an example, a sample instance might be a natural language sentence, and the output label is an annotated parse tree. Training a classifier consists of showing pairs of correct sample and output label pairs. After training, the structured SVM model allows one to predict for new sample instances the corresponding output label; that is, given a natural language sentence, the classifier can produce the most likely parse tree.
For a set of
n
(\boldsymbol{x}i,yi)\inl{X} x l{Y}
i=1,...,n
l{X}
l{Y}
\underset{\boldsymbol{w}}{min} \|\boldsymbol{w}\|2+C
n | |
\sum | |
i=1 |
\underset{y\inl{Y}}{max}\left(0,\Delta(yi,y)+\langle\boldsymbol{w},\Psi(\boldsymbol{x}i,y)\rangle-\langle\boldsymbol{w},\Psi(\boldsymbol{x}i,yi)\rangle\right)
\boldsymbol{w}
\Delta:l{Y} x l{Y}\toR+
\Delta(y,z)\geq0
\Delta(y,y)=0 \forally,z\inl{Y}
\Psi:l{X} x l{Y}\toRd
Because the regularized risk function above is non-differentiable, it is often reformulated in terms of a quadratic program by introducing one slack variable
\xii
\begin{array}{cl} \underset{\boldsymbol{w},\boldsymbol{\xi}}{min}&\|\boldsymbol{w}\|2+C
n | |
\sum | |
i=1 |
\xii\\ rm{s.t.}&\langle\boldsymbol{w},\Psi(\boldsymbol{x}i,yi)\rangle-\langle\boldsymbol{w},\Psi(\boldsymbol{x}i,y)\rangle+\xii\geq\Delta(yi,y), i=1,...,n, \forally\inl{Y} \end{array}
At test time, only a sample
\boldsymbol{x}\inl{X}
f:l{X}\tol{Y}
l{Y}
\boldsymbol{w}
f(\boldsymbol{x})=\underset{y\inl{Y}}{rm{argmax}} \langle\boldsymbol{w},\Psi(\boldsymbol{x},y)\rangle
Therefore, the maximizer over the label space is the predicted label. Solving for this maximizer is the so-called inference problem and similar to making a maximum a-posteriori (MAP) prediction in probabilistic models. Depending on the structure of the function
\Psi
The above quadratic program involves a very large, possibly infinite number of linear inequality constraints. In general, the number of inequalities is too large to be optimized over explicitly. Instead the problem is solved by using delayed constraint generation where only a finite and small subset of the constraints is used. Optimizing over a subset of the constraints enlarges the feasible set and will yield a solution that provides a lower bound on the objective. To test whether the solution
\boldsymbol{w}
(\boldsymbol{x}i,yi)
* | |
y | |
n |
=\underset{y\inl{Y}}{rm{argmax}}\left(\Delta(yi,y)+\langle\boldsymbol{w},\Psi(\boldsymbol{x}i,y)\rangle-\langle\boldsymbol{w},\Psi(\boldsymbol{x}i,yi)\rangle-\xii\right)
The right hand side objective to be maximized is composed of the constant
-\langle\boldsymbol{w},\Psi(\boldsymbol{x}i,yi)\rangle-\xii
\Delta(yi,y)+\langle\boldsymbol{w},\Psi(\boldsymbol{x}i,y)\rangle
If the constants are dropped from the above problem, we obtain the following problem to be solved.
* | |
y | |
i |
=\underset{y\inl{Y}}{rm{argmax}}\left(\Delta(yi,y)+\langle\boldsymbol{w},\Psi(\boldsymbol{x}i,y)\rangle\right)
This problem looks very similar to the inference problem. The only difference is the addition of the term
\Delta(yi,y)
\Delta