Hinge loss explained
In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs).[1]
For an intended output and a classifier score, the hinge loss of the prediction is defined as
Note that
should be the "raw" output of the classifier's decision function, not the predicted class label. For instance, in linear SVMs,
, where
are the parameters of the
hyperplane and
is the input variable(s).
When and have the same sign (meaning predicts the right class) and
, the hinge loss
. When they have opposite signs,
increases linearly with, and similarly if
, even if it has the same sign (correct prediction, but not by enough margin).
Extensions
While binary SVMs are commonly extended to multiclass classification in a one-vs.-all or one-vs.-one fashion,[2] it is also possible to extend the hinge loss itself for such an end. Several different variations of multiclass hinge loss have been proposed.[3] For example, Crammer and Singer[4] defined it for a linear classifier as[5]
\ell(y)=max(0,1+maxywyx-wtx)
,
where
is the target label,
and
are the model parameters.
Weston and Watkins provided a similar definition, but with a sum rather than a max:[6]
\ell(y)=\sumymax(0,1+wyx-wtx)
.
In structured prediction, the hinge loss can be further extended to structured output spaces. Structured SVMs with margin rescaling use the following variant, where denotes the SVM's parameters, the SVM's predictions, the joint feature function, and the Hamming loss:
\begin{align}
\ell(y)&=max(0,\Delta(y,t)+\langlew,\phi(x,y)\rangle-\langlew,\phi(x,t)\rangle)\\
&=max(0,maxy
} \left(\Delta(\mathbf, \mathbf) + \langle \mathbf, \phi(\mathbf, \mathbf) \rangle \right) - \langle \mathbf, \phi(\mathbf, \mathbf) \rangle)\end.
Optimization
The hinge loss is a convex function, so many of the usual convex optimizers used in machine learning can work with it. It is not differentiable, but has a subgradient with respect to model parameters of a linear SVM with score function
that is given by
=\begin{cases}
-t ⋅ xi&ift ⋅ y<1,\\
0&otherwise.
\end{cases}
However, since the derivative of the hinge loss at
is undefined,
smoothed versions may be preferred for optimization, such as Rennie and Srebro's
[7] \ell(y)=\begin{cases}
-ty&if~~ty\le0,\\
(1-ty)2&if~~0<ty<1,\\
0&if~~1\lety
\end{cases}
or the quadratically smoothed
\ell\gamma(y)=\begin{cases}
max(0,1-ty)2&if~~ty\ge1-\gamma,\\
1-
-ty&otherwise
\end{cases}
is a special case of this loss function with
, specifically
.
Notes and References
- Rosasco . L. . De Vito . E. D. . Caponnetto . A. . Piana . M. . Verri . A. . Are Loss Functions All the Same? . 10.1162/089976604773135104 . Neural Computation . 16 . 5 . 1063–1076 . 2004 . 15070510. 10.1.1.109.6786 .
- Book: Duan . K. B. . Keerthi . S. S. . Which Is the Best Multiclass SVM Method? An Empirical Study . 10.1007/11494683_28 . Multiple Classifier Systems . LNCS. 3541 . 278–285 . 2005 . 978-3-540-26306-7 . http://www.keerthis.com/multiclass_mcs_kaibo_05.pdf. 10.1.1.110.6789 .
- A Unified View on Multi-class Support Vector Classification . 2016 . . 17 . 1–32 . Doğan . Ürün . Glasmachers . Tobias . Igel . Christian.
- On the algorithmic implementation of multiclass kernel-based vector machines . 2001 . . 2 . 265–292 . Crammer . Koby . Singer . Yoram.
- Robert C. . Moore . John . DeNero . L1 and L2 regularization for multiclass hinge loss models . Proc. Symp. on Machine Learning in Speech and Language Processing . 2011 . 2013-10-23 . 2017-08-28 . https://web.archive.org/web/20170828233715/http://www.ttic.edu/sigml/symposium2011/papers/Moore+DeNero_Regularization.pdf . dead .
- Jason . Weston . Chris . Watkins . Support Vector Machines for Multi-Class Pattern Recognition . European Symposium on Artificial Neural Networks . 1999 . 2017-03-01 . 2018-05-05 . https://web.archive.org/web/20180505024710/https://www.elen.ucl.ac.be/Proceedings/esann/esannpdf/es1999-461.pdf . dead .
- Loss Functions for Preference Levels: Regression with Discrete Ordered Labels . Jason D. M. . Rennie . Nathan . Srebro . Proc. IJCAI Multidisciplinary Workshop on Advances in Preference Handling . 2005 .
- Zhang . Tong . Solving large scale linear prediction problems using stochastic gradient descent algorithms . ICML . 2004 .