In machine learning and mathematical optimization, loss functions for classification are computationally feasible loss functions representing the price paid for inaccuracy of predictions in classification problems (problems of identifying which category a particular observation belongs to).[1] Given
l{X}
l{X}\subsetRd
l{Y}=\{-1,1\}
f:l{X}\tol{Y}
y
\vec{x}
\vec{x}
y
I[f]=\displaystyle\intl{X x l{Y}}V(f(\vec{x}),y)p(\vec{x},y)d\vec{x}dy
V(f(\vec{x}),y)
p(\vec{x},y)
p(\vec{x},y)=p(y\mid\vec{x})p(\vec{x}).
Within classification, several commonly used loss functions are written solely in terms of the product of the true label
y
f(\vec{x})
\upsilon=yf(\vec{x})
V(f(\vec{x}),y)=\phi(yf(\vec{x}))=\phi(\upsilon)
\phi:R\toR
\phi
* | |
f | |
\phi |
In the case of binary classification, it is possible to simplify the calculation of expected risk from the integral specified above. Specifically,
\begin{align} I[f]&=\intl{X x l{Y}}V(f(\vec{x}),y)p(\vec{x},y)d\vec{x}dy\\[6pt] &=\intl{X}\intl{Y}\phi(yf(\vec{x}))p(y\mid\vec{x})p(\vec{x})dyd\vec{x}\\[6pt] &=\intl{X}[\phi(f(\vec{x}))p(1\mid\vec{x})+\phi(-f(\vec{x}))p(-1\mid\vec{x})]p(\vec{x})d\vec{x}\\[6pt] &=\intl{X}[\phi(f(\vec{x}))p(1\mid\vec{x})+\phi(-f(\vec{x}))(1-p(1\mid\vec{x}))]p(\vec{x})d\vec{x} \end{align}
The second equality follows from the properties described above. The third equality follows from the fact that 1 and −1 are the only possible values for
y
p(-1\midx)=1-p(1\midx)
[\phi(f(\vec{x}))p(1\mid\vec{x})+\phi(-f(\vec{x}))(1-p(1\mid\vec{x}))]
One can solve for the minimizer of
I[f]
f
\partial\phi(f) | |
\partialf |
η+
\partial\phi(-f) | |
\partialf |
(1-η)=0, (1)
where
η=p(y=1|\vec{x})
Given the binary nature of classification, a natural selection for a loss function (assuming equal cost for false positives and false negatives) would be the 0-1 loss function (0–1 indicator function), which takes the value of 0 if the predicted classification equals that of the true class or a 1 if the predicted classification does not match the true class. This selection is modeled by
V(f(\vec{x}),y)=H(-yf(\vec{x}))
where
H
In practice, the probability distribution
p(\vec{x},y)
n
S=\{(\vec{x}1,y1),...,(\vec{x}n,yn)\}
drawn from the data sample space, one seeks to minimize empirical risk
IS[f]=
1 | |
n |
n | |
\sum | |
i=1 |
V(f(\vec{x}i),yi)
as a proxy for expected risk. (See statistical learning theory for a more detailed description.)
Utilizing Bayes' theorem, it can be shown that the optimal
* | |
f | |
0/1 |
* | |
f | |
0/1 |
(\vec{x}) = \begin{cases} 1&ifp(1\mid\vec{x})>p(-1\mid\vec{x})\ 0&ifp(1\mid\vec{x})=p(-1\mid\vec{x})\ -1&ifp(1\mid\vec{x})<p(-1\mid\vec{x})\end{cases}
A loss function is said to be classification-calibrated or Bayes consistent if its optimal
* | |
f | |
\phi |
* | |
f | |
0/1 |
(\vec{x})=
* | |
\operatorname{sgn}(f | |
\phi |
(\vec{x}))
* | |
f | |
\phi |
For convex margin loss
\phi(\upsilon)
\phi(\upsilon)
\phi'(0)<0
\phi(v)=C[f-1(v)]+(1-f-1(v))C'[f-1(v)] (2)
where
f(η),(0\leqη\leq1)
f-1(-v)=1-f-1(v)
C(η)
C(η)=C(1-η)
C(η)
f-1(v)
p(y=1|\vec{x})
p(y=1|\vec{x})=η=f-1(v)
Exponential | e-v | 2\sqrt{η(1-η)} |
|
) | ||||||||||||
Logistic |
log(1+e-v) |
[-ηlog(η)-(1-η)log(1-η)] |
|
) | ||||||||||||
Square | (1-v)2 | 4η(1-η) |
(v+1) | 2η-1 | ||||||||||||
Savage |
| η(1-η) |
|
) | ||||||||||||
Tangent | (2\arctan(v)-1)2 | 4η(1-η) |
|
) |
* | |
f | |
\phi |
f(η)
For proper loss functions, the loss margin can be defined as
\mu\phi=-
\phi'(0) | |
\phi''(0) |
\gamma
1 | |
\gamma |
log(1+e-\gamma)
0<\gamma<1
Fm(x)=Fm-1(x)+\gammahm(x),
\gamma
\gamma
η=f-1(\gammaF(x))
In conclusion, by choosing a loss function with larger margin (smaller
\gamma
While more commonly used in regression, the square loss function can be re-written as a function
\phi(yf(\vec{x}))
\phi(v)=C[f-1(v)]+(1-f-1(v))C'[f-1(v)]=4(
1 | (v+1))(1- | |
2 |
1 | (v+1))+(1- | |
2 |
1 | (v+1))(4-8( | |
2 |
1 | |
2 |
(v+1)))=(1-v)2.
The square loss function is both convex and smooth. However, the square loss function tends to penalize outliers excessively, leading to slower convergence rates (with regards to sample complexity) than for the logistic loss or hinge loss functions. In addition, functions which yield high values of
f(\vec{x})
x\inX
yf(\vec{x})
y
f(\vec{x})
A benefit of the square loss function is that its structure lends itself to easy cross validation of regularization parameters. Specifically for Tikhonov regularization, one can solve for the regularization parameter using leave-one-out cross-validation in the same time as it would take to solve a single problem.
The minimizer of
I[f]
* | |
f | |
Square= |
2η-1=2p(1\midx)-1.
The logistic loss function can be generated using (2) and Table-I as follows
\begin{align} \phi(v)&=C[f-1(v)]+\left(1-f-1(v)\right)C'\left[f-1(v)\right]\ &=
1 | |
log(2) |
\left[
-ev | log | |
1+ev |
ev | -\left(1- | |
1+ev |
ev | \right)log\left(1- | |
1+ev |
ev | |
1+ev |
\right)\right]+\left(1-
ev | |
1+ev |
\right)\left[
-1 | log\left( | |
log(2) |
| ||||
|
\right)\right]\ &=
1 | |
log(2) |
log(1+e-v). \end{align}
The logistic loss is convex and grows linearly for negative values which make it less sensitive to outliers. The logistic loss is used in the LogitBoost algorithm.
The minimizer of
I[f]
* | ||
f | log\left( | |
Logistic= |
η | \right)=log\left( | |
1-η |
p(1\midx) | |
1-p(1\midx) |
\right).
This function is undefined when
p(1\midx)=1
p(1\midx)=0
p(1\midx)
p(1\midx)=0.5
It's easy to check that the logistic loss and binary cross-entropy loss (Log loss) are in fact the same (up to a multiplicative constant
1 | |
log(2) |
The exponential loss function can be generated using (2) and Table-I as follows
\phi(v)=C[f-1(v)]+(1-f-1(v))C'[f-1(v)]=2\sqrt{\left(
e2v | \right)\left(1- | |
1+e2v |
e2v | \right)}+\left(1- | |
1+e2v |
e2v | \right)\left( | |
1+e2v |
| |||||||
|
The exponential loss is convex and grows exponentially for negative values which makes it more sensitive to outliers. The exponentially-weighted 0-1 loss is used in the AdaBoost algorithm giving implicitly rise to the exponential loss.
The minimizer of
I[f]
* | |
f | |
Exp= |
1 | log\left( | |
2 |
η | \right)= | |
1-η |
1 | log\left( | |
2 |
p(1\midx) | |
1-p(1\midx) |
\right).
The Savage loss can be generated using (2) and Table-I as follows
\phi(v)=C[f-1(v)]+(1-f-1(v))C'[f-1(v)]=\left(
ev | \right)\left(1- | |
1+ev |
ev | \right)+\left(1- | |
1+ev |
ev | \right)\left(1- | |
1+ev |
2ev | |
1+ev |
\right)=
1 | |
(1+ev)2 |
.
The Savage loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. The Savage loss has been used in gradient boosting and the SavageBoost algorithm.
The minimizer of
I[f]
* | ||
f | log\left( | |
Savage= |
η | \right)=log\left( | |
1-η |
p(1\midx) | |
1-p(1\midx) |
\right).
The Tangent loss[6] can be generated using (2) and Table-I as follows
\begin{align} \phi(v)&=C[f-1(v)]+\left(1-f-1(v)\right)C'[f-1(v)]\ &=4\left(\arctan(v)+
1 | |
2 |
\right)\left(1-\left(\arctan(v)+
1 | |
2 |
\right)\right)+\left(1-\left(\arctan(v)+
1 | |
2 |
\right)\right)\left(4-8\left(\arctan(v)+
1 | |
2 |
\right)\right)\\ &=\left(2\arctan(v)-1\right)2. \end{align}
The Tangent loss is quasi-convex and is bounded for large negative values which makes it less sensitive to outliers. Interestingly, the Tangent loss also assigns a bounded penalty to data points that have been classified "too correctly". This can help prevent over-training on the data set. The Tangent loss has been used in gradient boosting, the TangentBoost algorithm and Alternating Decision Forests.[7]
The minimizer of
I[f]
* | |
f | |
Tangent= |
\tan\left(η-
1 | |
2 |
\right)=\tan\left(p\left(1\midx\right)-
1 | |
2 |
\right).
See main article: Hinge loss. The hinge loss function is defined with
\phi(\upsilon)=max(0,1-\upsilon)=[1-\upsilon]+
[a]+=max(0,a)
V(f(\vec{x}),y)=max(0,1-yf(\vec{x}))=[1-yf(\vec{x})]+.
The hinge loss provides a relatively tight, convex upper bound on the 0–1 indicator function. Specifically, the hinge loss equals the 0–1 indicator function when
\operatorname{sgn}(f(\vec{x}))=y
|yf(\vec{x})|\geq1
While the hinge loss function is both convex and continuous, it is not smooth (is not differentiable) at
yf(\vec{x})=1
yf(\vec{x})=1
The minimizer of
I[f]
* | |
f | |
Hinge(\vec{x}) |
= \begin{cases}1&ifp(1\mid\vec{x})>p(-1\mid\vec{x})\ -1&ifp(1\mid\vec{x})<p(-1\mid\vec{x})\end{cases}
when
p(1\midx)\ne0.5
* | |
f | |
Hinge |
The generalized smooth hinge loss function with parameter
\alpha
* | |
f | |
\alpha(z) |
= \begin{cases}
\alpha | |
\alpha+1 |
-z&ifz\leq0\
1 | |
\alpha+1 |
z\alpha-z+
\alpha | |
\alpha+1 |
&if0<z<1\ 0&ifz\geq1\end{cases},
where
z=yf(\vec{x}).
It is monotonically increasing and reaches 0 when
z=1