Least-squares support-vector machines (LS-SVM) for statistics and in statistical modeling, are least-squares versions of support-vector machines (SVM), which are a set of related supervised learning methods that analyze data and recognize patterns, and which are used for classification and regression analysis. In this version one finds the solution by solving a set of linear equations instead of a convex quadratic programming (QP) problem for classical SVMs. Least-squares SVM classifiers were proposed by Johan Suykens and Joos Vandewalle.[1] LS-SVMs are a class of kernel-based learning methods.
Given a training set
\{xi,yi
N | |
\} | |
i=1 |
xi\inRn
yi\in\{-1,+1\}
\begin{cases} wT\phi(xi)+b\ge1,&if yi=+1,\\ wT\phi(xi)+b\le-1,&if yi=-1, \end{cases}
which is equivalent to
yi\left[{wT\phi(xi)+b}\right]\ge1, i=1,\ldots,N,
where
\phi(x)
In case such a separating hyperplane does not exist, we introduce so-called slack variables
\xii
\begin{cases} yi\left[{wT\phi(xi)+b}\right]\ge1-\xii,&i=1,\ldots,N,\\ \xii\ge0,&i=1,\ldots,N. \end{cases}
According to the structural risk minimization principle, the risk bound is minimized by the following minimization problem:
minJ1(w,\xi)=
1 | |
2 |
wTw+
N | |
c\sum\limits | |
i=1 |
\xii,
Subjectto\begin{cases} yi\left[{wT\phi(xi)+b}\right]\ge1-\xii,&i=1,\ldots,N,\\ \xii\ge0,&i=1,\ldots,N, \end{cases}
To solve this problem, we could construct the Lagrangian function:
L | ||||
|
wTw+
N | |
c\sum\limits | |
i=1 |
{\xii}-
N | |
\sum\limits | |
i=1 |
\alphai\left\{yi\left[{wT\phi(xi)+b}\right]-1+\xii\right\}-
N | |
\sum\limits | |
i=1 |
\betai\xii,
where
\alphai\ge0, \betai\ge0 (i=1,\ldots,N)
By substituting
w
maxQ1(\alpha)=-
1 | |
2 |
N | |
\sum\limits | |
i,j=1 |
{\alphai\alphajyiyjK(xi,xj)}+
N | |
\sum\limits | |
i=1 |
\alphai,
where
K(xi,xj)=\left\langle\phi(xi),\phi(xj)\right\rangle
The least-squares version of the SVM classifier is obtained by reformulating the minimization problem as
minJ2(w,b,e)=
\mu | |
2 |
wTw+
\zeta | |
2 |
N | |
\sum\limits | |
i=1 |
2, | |
e | |
i |
subject to the equality constraints
yi\left[{wT\phi(xi)+b}\right]=1-ei, i=1,\ldots,N.
The least-squares SVM (LS-SVM) classifier formulation above implicitly corresponds to a regression interpretation with binary targets
yi=\pm1
Using
2 | |
y | |
i |
=1
N | |
\sum\limits | |
i=1 |
2 | |
e | |
i |
=
N | |
\sum\limits | |
i=1 |
(yi
2 | |
e | |
i) |
=
N | |
\sum\limits | |
i=1 |
2 | |
e | |
i |
=
N | |
\sum\limits | |
i=1 |
\left(yi-(wT\phi(xi)+b)\right)2,
with
ei=yi-(wT\phi(xi)+b).
Hence the LS-SVM classifier formulation is equivalent to
J2(w,b,e)=\muEW+\zetaED
with
EW=
1 | |
2 |
wTw
ED=
1 | |
2 |
N | |
\sum\limits | |
i=1 |
2 | |
e | |
i |
=
1 | |
2 |
N | |
\sum\limits | |
i=1 |
\left(yi-(wT\phi(xi)+b)\right)2.
Both
\mu
\zeta
\gamma=\zeta/\mu
\gamma
\mu
\zeta
The solution of LS-SVM regressor will be obtained after we construct the Lagrangian function:
\begin{cases} L2(w,b,e,\alpha) =J2(w,e)-
N | |
\sum\limits | |
i=1 |
\alphai\left\{{\left[{wT\phi(xi)+b}\right]+ei-yi}\right\},\\ =
1 | |
2 |
wTw+
\gamma | |
2 |
N | |
\sum\limits | |
i=1 |
2 | |
e | |
i |
-
N | |
\sum\limits | |
i=1 |
\alphai\left\{\left[wT\phi(xi)+b\right]+ei-yi\right\}, \end{cases}
where
\alphai\inR
\begin{cases}
\partialL2 | |
\partialw |
=0 \to w=
N | |
\sum\limits | |
i=1 |
\alphai\phi(xi),\\
\partialL2 | |
\partialb |
=0 \to
N | |
\sum\limits | |
i=1 |
\alphai=0,\\
\partialL2 | |
\partialei |
=0 \to \alphai=\gammaei, i=1,\ldots,N,\\
\partialL2 | |
\partial\alphai |
=0 \to yi=wT\phi(xi)+b+ei,i=1,\ldots,N. \end{cases}
Elimination of
w
e
\left[\begin{matrix} 0&
T | |
1 | |
N |
\\ 1N&\Omega+\gammaIN \end{matrix}\right]\left[\begin{matrix} b\\ \alpha \end{matrix}\right]=\left[\begin{matrix} 0\\ Y \end{matrix}\right],
with
Y=[y1,\ldots,yN]T
1N=[1,\ldots,1]T
\alpha=[\alpha1,\ldots,\alphaN]T
IN
N x N
\Omega\inRN
\Omegaij=\phi(xi)T\phi(xj)=K(xi,xj)
For the kernel function K(•, •) one typically has the following choices:
K(x,xi)=
T | |
x | |
i |
x,
d
K(x,xi)=\left({1+
T | |
x | |
i |
x/c}\right)d,
K(x,xi)=\exp\left({-\left\|{x-xi}\right\|2/\sigma2}\right),
K(x,xi)=\tanh\left(
T | |
{kx | |
i |
x+\theta}\right),
where
d
c
\sigma
k
\theta
c,\sigma\inR+
d\inN
k
\theta
c
\sigma
k
A Bayesian interpretation of the SVM has been proposed by Smola et al. They showed that the use of different kernels in SVM can be regarded as defining different prior probability distributions on the functional space, as
P[f]\propto\exp\left({-\beta\left\|{\hatPf}\right\|2}\right)
\beta>0
\hat{P}
A general Bayesian evidence framework was developed by MacKay,[3] [4] [5] and MacKay has used it to the problem of regression, forward neural network and classification network. Provided data set
D
M
w
λ
λ
w
p(w|D,λ,M)\proptop(D|w,M)p(w|λ,M).
λ
p(λ|D,M)\proptop(D|λ,M)p(λ|M).
p(M|D)\proptop(D|M)p(M).
We can see that Bayesian evidence framework is a unified theory for learning the model and model selection.Kwok used the Bayesian evidence framework to interpret the formulation of SVM and model selection. And he also applied Bayesian evidence framework to support vector regression.
Now, given the data points
\{xi,yi\}
N | |
i=1 |
\mu
\zeta
M
w
b
p(w,b|D,log\mu,log\zeta,M)
p(w,b|D,log\mu,log\zeta,M)=
{p(D|w,b,log\mu,log\zeta,M)p(w,b|log\mu,log\zeta,M) | |
where
p(D|log\mu,log\zeta,M)
w
b
w
b
\zeta
p(w,b|log\mu,log\zeta,M)=p(w|log\mu,M)p(b|log\sigmab,M).
When
\sigmab\toinfty
b
w
b
w
b
\sigmab\toinfty
\begin{array}{l} p(w,b|log\mu,)=\left({
\mu | |
{2\pi |
Here
nf
w
The probability of
p(D|w,b,log\mu,log\zeta,M)
w,b,\zeta
M
p(D|w,b,log\zeta,M)=
N | |
\prod\limits | |
i=1 |
{p(xi,yi|w,b,log\zeta,M)}.
In order to obtain the least square cost function, it is assumed that the probability of a data point is proportional to:
p(xi,yi|w,b,log\zeta,M)\proptop(ei|w,b,log\zeta,M).
A Gaussian distribution is taken for the errors
ei=yi-(wT\phi(xi)+b)
p(ei|w,b,log\zeta,M)=\sqrt{
\zeta | |
{2\pi |
It is assumed that the
w
b
\hatm-
\hatm+
wT\phi(x)+b
\phi(x)
1/\zeta
Combining the preceding expressions, and neglecting all constants, Bayes’ rule becomes
p(w,b|D,log\mu,log\zeta,M)\propto\exp(-
\mu | |
2 |
wTw-
\zeta | |
2 |
N | |
\sum\limits | |
i=1 |
2 | |
{e | |
i |
})=\exp(-J2(w,b)).
The maximum posterior density estimates
wMP
bMP