In statistics, kernel Fisher discriminant analysis (KFD),[1] also known as generalized discriminant analysis[2] and kernel discriminant analysis,[3] is a kernelized version of linear discriminant analysis (LDA). It is named after Ronald Fisher.
Intuitively, the idea of LDA is to find a projection where class separation is maximized. Given two sets of labeled data,
C1
C2
m1
m2
mi=
1 | |
li |
li | |
\sum | |
n=1 |
i, | |
x | |
n |
where
li
Ci
w
J(w)=
wTSBw | |
wTSWw |
,
where
SB
SW
\begin{align} SB&=(m2-m1)(m2-m
T | |
1) |
\\ SW&=\sumi=1,2
li | |
\sum | |
n=1 |
i-m | |
(x | |
i)(x |
T | |
i) |
. \end{align}
The maximum of the above ratio is attained at
w\propto
-1 | |
S | |
W(m |
2-m1).
as can be shown by the Lagrange multiplier method (sketch of proof): Maximizing
J(w)=
wTSBw | |
wTSWw |
wTSBw
subject to
wTSWw=1.
This, in turn, is equivalent to maximizing
I(w,λ)=wTSBw-λ(wTSWw-1)
λ
At the maximum, the derivatives of
I(w,λ)
w
λ
dI | |
dw |
=0
SBw-λSWw=0,
which is trivially satisfied by
w=c
-1 | |
S | |
W(m |
2-m1)
λ=(m2-m
T | |
1) |
-1 | |
S | |
W(m |
2-m1).
To extend LDA to non-linear mappings, the data, given as the
\ell
xi,
F,
\phi.
J(w)=
| |||||||||||||
|
,
where
\phi | |
\begin{align} S | |
B |
&=\left
\phi | |
(m | |
2 |
\phi | |
-m | |
1 |
\right)\left
\phi | |
(m | |
2 |
\phi | |
-m | |
1 |
\right)T
\phi | |
\\ S | |
W |
&=\sumi=1,2
li | |
\sum | |
n=1 |
\left
\phi | |
(\phi(x | |
i |
\right)\left
\phi | |
(\phi(x | |
i |
\right)T, \end{align}
and
\phi | |
m | |
i |
=
1 | |
li |
li | |
\sum | |
j=1 |
i). | |
\phi(x | |
j |
Further, note that
w\inF
\phi(xi)
F
F
k(x,y)=\phi(x) ⋅ \phi(y)
LDA can be reformulated in terms of dot products by first noting that
w
w=
l\alpha | |
\sum | |
i\phi(x |
i).
Then note that
wT
\phi | |
m | |
i |
=
1 | |
li |
l | |
\sum | |
j=1 |
li | |
\sum | |
k=1 |
\alphajk\left(xj,x
i | |
k |
\right)=\alphaTMi,
where
(Mi)j=
1 | |
li |
li | |
\sum | |
k=1 |
k(xj,x
i). | |
k |
The numerator of
J(w)
wT
\phi | |
S | |
B |
w=wT\left
\phi | |
(m | |
2 |
\phi | |
-m | |
1 |
\right)\left
\phi | |
(m | |
2 |
\phi | |
-m | |
1 |
\right)Tw=\alphaTM\alpha, where M=(M2-M1)(M2-M
T | |
1) |
.
Similarly, the denominator can be written as
wT
\phi | |
S | |
W |
w=\alphaTN\alpha, where N=\sumj=1,2Kj(I-1
lj |
T | |
)K | |
j |
,
with the
nth,mth
Kj
k(xn,x
j), | |
m |
I
1 | |
lj |
1/lj
wT
\phi | |
S | |
W |
w
w
\phi | |
S | |
W |
\phi | |
m | |
i |
\begin{align} wT
\phi | |
S | |
W |
w&=
T | |
\left(\sum | |
i\phi |
(xi)\right)\left(\sumj=1,2
lj | |
\sum | |
n=1 |
\left
\phi | |
(\phi(x | |
j |
\right)\left
\phi | |
(\phi(x | |
j |
\right)T\right)
l\alpha | |
\left(\sum | |
k\phi(x |
k)\right)\\ &=\sumj=1,2
lj | |
\sum | |
n=1 |
l | |
\sum | |
k=1 |
\left
T | |
(\alpha | |
i\phi |
(xi)\left
\phi | |
(\phi(x | |
j |
\right)\left
\phi | |
(\phi(x | |
j |
\right)T\alphak\phi(xk)\right)\\ &=\sumj=1,2
lj | |
\sum | |
n=1 |
l | |
\sum | |
k=1 |
\left(\alphaik(xi,x
| ||||
n |
lj | |
\sum | |
p=1 |
\alphaik(xi,x
j)\right) \left(\alpha | |
kk(x |
k,x
| ||||
n |
lj | |
\sum | |
q=1 |
\alphakk(xk,x
j)\right) | |
q |
\ &=\sumj=1,2\left(
lj | |
\sum | |
n=1 |
l | |
\sum | |
k=1 |
\left(\alphai\alphakk(xi,x
j)k(x | |
k,x |
j) | |
n |
-
2\alphai\alphak | |
lj |
lj | |
\sum | |
p=1 |
k(xi,x
j)k(x | |
k,x |
j) | |
p |
+
\alphai\alphak | ||||||
|
lj | |
\sum | |
p=1 |
lj | |
\sum | |
q=1 |
k(xi,x
j)k(x | |
k,x |
j) | |
q |
\right)\right)\\ &=\sumj=1,2\left(
lj | |
\sum | |
n=1 |
l\left( | |
\sum | |
k=1 |
\alphai\alphakk(xi,x
j)k(x | |
k,x |
j) | |
n |
-
\alphai\alphak | |
lj |
lj | |
\sum | |
p=1 |
k(xi,x
j)k(x | |
k,x |
j) | |
p |
\right)\right)\\[6pt] &=\sumj=1,2\alphaTKjK
T | |
j |
\alpha-\alphaTKj1
lj |
T | |
K | |
j |
\alpha\\[4pt] &=\alphaTN\alpha. \end{align}
With these equations for the numerator and denominator of
J(w)
J
J(\alpha)=
\alphaTM\alpha | |
\alphaTN\alpha |
.
Then, differentiating and setting equal to zero gives
(\alphaTM\alpha)N\alpha=(\alphaTN\alpha)M\alpha.
Since only the direction of
w
\alpha,
\alpha
\alpha=N-1(M2-M1).
Note that in practice,
N
N\epsilon=N+\epsilonI.
Given the solution for
\alpha
y(x)=(w ⋅ \phi(x))=
l\alpha | |
\sum | |
ik(x |
i,x).
The extension to cases where there are more than two classes is relatively straightforward.[6] [7] Let
c
(c-1)
(c-1)
yi=
T | |
w | |
i |
\phi(x) i=1,\ldots,c-1.
This can be written in matrix notation
y=WT\phi(x),
where the
wi
W
\phi | |
S | |
B |
=
c | |
\sum | |
i=1 |
li(m
\phi | |
i |
-m\phi
\phi | |
)(m | |
i |
-m\phi)T,
where
m\phi
\phi | |
S | |
W |
=
c | |
\sum | |
i=1 |
li | |
\sum | |
n=1 |
\phi | |
(\phi(x | |
i |
\phi | |
)(\phi(x | |
i |
)T,
The solution is now obtained by maximizing
J(W)=
| |||||||||||||
|
.
The kernel trick can again be used and the goal of multi-class KFD becomes
A*=\underset{A
A=[\alpha1,\ldots,\alphac-1]
\begin{align} M&=
cl | |
\sum | |
j(M |
j-M*)(Mj-M*)T\\ N&=
cK | |
\sum | |
j(I-1 |
lj |
T | |
)K | |
j |
. \end{align}
The
Mi
M*
(M*)j=
1 | |
l |
l | |
\sum | |
k=1 |
k(xj,xk).
A*
(c-1)
N-1M
xt
y(xt)=\left(A*\right)TKt,
where the
ith
Kt
k(xi,xt)
In both two-class and multi-class KFD, the class label of a new input can be assigned as
f(x)=argminjD(y(x),\bar{y
where
\bar{y
j
D( ⋅ , ⋅ )
Kernel discriminant analysis has been used in a variety of applications. These include: