In functional analysis (a branch of mathematics), a reproducing kernel Hilbert space (RKHS) is a Hilbert space of functions in which point evaluation is a continuous linear functional. Roughly speaking, this means that if two functions
f
g
\|f-g\|
f
g
|f(x)-g(x)|
x
\sin2n(x)
It is not entirely straightforward to construct a Hilbert space of functions which is not an RKHS.[1] Some examples, however, have been found.[2] [3]
L2 spaces are not Hilbert spaces of functions (and hence not RKHSs), but rather Hilbert spaces of equivalence classes of functions (for example, the functions
f
g
f(x)=0
g(x)=1Q
An RKHS is associated with a kernel that reproduces every function in the space in the sense that for every
x
x
The reproducing kernel was first introduced in the 1907 work of Stanisław Zaremba concerning boundary value problems for harmonic and biharmonic functions. James Mercer simultaneously examined functions which satisfy the reproducing property in the theory of integral equations. The idea of the reproducing kernel remained untouched for nearly twenty years until it appeared in the dissertations of Gábor Szegő, Stefan Bergman, and Salomon Bochner. The subject was eventually systematically developed in the early 1950s by Nachman Aronszajn and Stefan Bergman.[4]
These spaces have wide applications, including complex analysis, harmonic analysis, and quantum mechanics. Reproducing kernel Hilbert spaces are particularly important in the field of statistical learning theory because of the celebrated representer theorem which states that every function in an RKHS that minimises an empirical risk functional can be written as a linear combination of the kernel function evaluated at the training points. This is a practically useful result as it effectively simplifies the empirical risk minimization problem from an infinite dimensional to a finite dimensional optimization problem.
For ease of understanding, we provide the framework for real-valued Hilbert spaces. The theory can be easily extended to spaces of complex-valued functions and hence include the many important examples of reproducing kernel Hilbert spaces that are spaces of analytic functions.[5]
Let
X
H
X
H
x
Lx:f\mapstof(x)\forallf\inH.
We say that H is a reproducing kernel Hilbert space if, for all
x
X
Lx
f
H
Lx
H
Mx>0
Although
Mx<infty
x\inX
While property is the weakest condition that ensures both the existence of an inner product and the evaluation of every function in
H
f
Kx
H
H
x
X
Kx
H
Since
Kx
X
R
C
Kx
H
Kx(y)=Ly(Kx)=\langleKx, Ky\rangleH,
Ky\inH
H
Ly
This allows us to define the reproducing kernel of
H
K:X x X\toR
C
K(x,y)=\langleKx, Ky\rangleH.
From this definition it is easy to see that
K:X x X\toR
C
n | |
\sum | |
i,j=1 |
cicjK(xi,xj)= \sum
n | |
i=1 |
ci\left\langle
K | |
xi |
,
n | |
\sum | |
j=1 |
cj
K | |
xj |
\right\rangleH=\left\langle
n | |
\sum | |
i=1 |
ci
K | |
xi |
,
n | |
\sum | |
j=1 |
cj
K | |
xj |
\right\rangleH
nc | |
= \left\|\sum | |
iK |
xi |
2 | |
\right\| | |
H |
\ge0
for every
n\inN,x1,...,xn\inX,andc1,...,cn\inR.
K
X
The space of bandlimited continuous functions
H
0<a<infty
H=\{f\inC(R)\mid\operatorname{supp}(F)\subset[-a,a]\}
where
C(R)
f
\langlef,
g\rangle | |
L2 |
=
infty | |
\int | |
-infty |
f(x) ⋅ \overline{g(x)}dx.
From the Fourier inversion theorem, we have
f(x)=
1 | |
2\pi |
a | |
\int | |
-a |
F(\omega)eixd\omega.
It then follows by the Cauchy–Schwarz inequality and Plancherel's theorem that, for all
x
|f(x)|\le
1 | |
2\pi |
\sqrt{
a | |
2a\int | |
-a |
|F(\omega)|2d\omega}=
\sqrt{2a | |
}{2\pi}\sqrt{\int |
infty | |
-infty |
|F(\omega)|2d\omega}=\sqrt{
a | |
\pi |
This inequality shows that the evaluation functional is bounded, proving that
H
The kernel function
Kx
Kx(y)=
a | |
\pi |
\operatorname{sinc}\left(
a | |
\pi |
(y-x)\right)=
\sin(a(y-x)) | |
\pi(y-x) |
.
The Fourier transform of
Kx(y)
infty | |
\int | |
-infty |
-i\omegay | |
K | |
x(y)e |
dy=\begin{cases} e-i&if\omega\in[-a,a],\\ 0&rm{otherwise}, \end{cases}
which is a consequence of the time-shifting property of the Fourier transform. Consequently, using Plancherel's theorem, we have
\langlef,Kx\rangle
L2 |
=
infty | |
\int | |
-infty |
f(y) ⋅ \overline{Kx(y)}dy=
1 | |
2\pi |
a | |
\int | |
-a |
F(\omega) ⋅ ei\omegad\omega=f(x).
Thus we obtain the reproducing property of the kernel.
Kx
Kx(y)
\delta(y-x)
a
We have seen how a reproducing kernel Hilbert space defines a reproducing kernel function that is both symmetric and positive definite. The Moore–Aronszajn theorem goes in the other direction; it states that every symmetric, positive definite kernel defines a unique reproducing kernel Hilbert space. The theorem first appeared in Aronszajn's Theory of Reproducing Kernels, although he attributes it to E. H. Moore.
Theorem. Suppose K is a symmetric, positive definite kernel on a set X. Then there is a unique Hilbert space of functions on X for which K is a reproducing kernel.
Proof. For all x in X, define Kx = K(x, ⋅). Let H0 be the linear span of . Define an inner product on H0 by
\left\langle
n | |
\sum | |
j=1 |
bj
K | |
yj |
,
m | |
\sum | |
i=1 |
ai
K | |
xi |
\right
\rangle | |
H0 |
=
m | |
\sum | |
i=1 |
n | |
\sum | |
j=1 |
{ai}bjK(yj,xi),
which implies
K(x,y)=\left\langleKx,Ky
\right\rangle | |
H0 |
Let H be the completion of H0 with respect to this inner product. Then H consists of functions of the form
f(x)=
infty | |
\sum | |
i=1 |
ai
K | |
xi |
(x) where \limn\supp\geq0
n+p | |
\left\|\sum | |
i=n |
ai
K | |
xi |
\right\| | |
H0 |
=0.
Now we can check the reproducing property :
\langlef,Kx\rangleH=
infty | |
\sum | |
i=1 |
ai\left\langle
K | |
xi |
,Kx\right
\rangle | |
H0 |
=
infty | |
\sum | |
i=1 |
aiK(xi,x)=f(x).
To prove uniqueness, let G be another Hilbert space of functions for which K is a reproducing kernel. For every x and y in X, implies that
\langleKx,Ky\rangleH=K(x,y)=\langleKx,Ky\rangleG.
By linearity,
\langle ⋅ , ⋅ \rangleH=\langle ⋅ , ⋅ \rangleG
\{Kx:x\inX\}
H\subsetG
Now we need to prove that every element of G is in H. Let
f
f=fH+
f | |
H\bot |
fH\inH
f | |
H\bot |
\inH\bot
x\inX
f(x)=\langleKx,f\rangleG=\langleKx,fH\rangleG+\langleKx,
f | |
H\bot |
\rangleG=\langleKx,fH\rangleG=\langleKx,fH\rangleH=fH(x),
where we have used the fact that
Kx
f | |
H\bot |
f=fH
We may characterize a symmetric positive definite kernel
K
X
\mu
K:X x X\to\R
TK:L2(X)\toL2(X)
[TKf]( ⋅ )=\intXK( ⋅ ,t)f(t)d\mu(t)
where
L2(X)
\mu
Mercer's theorem states that the spectral decomposition of the integral operator
TK
K
K
TK
K
Under these assumptions
TK
(\sigmai)i\geq0
TK\varphii(x)=\sigmai\varphii(x)
\{\varphii\}
L2(X)
TK,\sigmai>0
i.
TK
C(X)
\varphii\inC(X)
i.
K
K(x,y)=
infty | |
\sum | |
j=1 |
\sigmaj\varphij(x)\varphij(y)
for all
x,y\inX
\limn\supu,v\left|K(u,v)-
n | |
\sum | |
j=1 |
\sigmaj\varphij(u)\varphij(v)\right|=0.
This above series representation is referred to as a Mercer kernel or Mercer representation of
K
Furthermore, it can be shown that the RKHS
H
K
H=\left\{f\inL2(X)\vert
infty | |
\sum | |
i=1 |
| |||||||||
\sigmai |
<infty\right\}
where the inner product of
H
\left\langlef,g\right\rangleH=
infty | |
\sum | |
i=1 |
| |||||||||||
\sigmai |
.
This representation of the RKHS has application in probability and statistics, for example to the Karhunen-Loève representation for stochastic processes and kernel PCA.
A feature map is a map
\varphi\colonX → F
F
Every feature map defines a kernel via
Clearly
K
F
For example, we can trivially take
F=H
\varphi(x)=Kx
x\inX
F=\ell2
\varphi(x)=(\sqrt{\sigmai}\varphii(x))i
This connection between kernels and feature maps provides us with a new way to understand positive definite functions and hence reproducing kernels as inner products in
H
Lastly, feature maps allow us to construct function spaces that reveal another perspective on the RKHS. Consider the linear space
H\varphi=\{f:X\toR\mid\existsw\inF,f(x)=\langlew,\varphi(x)\rangleF,\forallx\inX\}.
We can define a norm on
H\varphi
\|f\|\varphi=inf\{\|w\|F:w\inF,f(x)=\langlew,\varphi(x)\rangleF,\forallx\inX\}.
It can be shown that
H\varphi
K(x,y)=\langle\varphi(x),\varphi(y)\rangleF
Useful properties of RKHSs:
(Xi)
p | |
i=1 |
(Ki)
p | |
i=1 |
(Xi)
p. | |
i=1 |
K((x1,\ldots,xp),(y1,\ldots,yp))=K1(x1,y1) … Kp(xp,yp)
is a kernel on
X=X1 x ... x Xp.
X0\subsetX,
K
X0 x X0
K
K(x,x)=1
x\inX
dK(x,y)=\|Kx-Ky\|
2 | |
H |
=2(1-K(x,y)) \forallx\inX.
By the Cauchy–Schwarz inequality,
K(x,y)2\leK(x,x)K(y,y)=1 \forallx,y\inX.
This inequality allows us to view
K
x,y\inX
K(x,y)
x,y\inX
K(x,y)
\{Kx\midx\inX\}
H
K(x,y)=\langlex,y\rangle
H
f(x)=\langlex,\beta\rangle
2=\|\beta\| | |
\|f\| | |
H |
2
K(x,y)=(\alpha\langlex,y\rangle+1)d, \alpha\in\R,d\in\N
These are another common class of kernels which satisfy
K(x,y)=K(\|x-y\|)
K(x,y)=
| ||||
e |
, \sigma>0
K(x,y)=
| ||||
e |
, \sigma>0
The squared norm of a function
f
H
2=\int | |
\|f\| | |
R |
(
1{\sigma} | |
f(x) |
2+\sigmaf'(x)2)dx.
We also provide examples of Bergman kernels. Let X be finite and let H consist of all complex-valued functions on X. Then an element of H can be represented as an array of complex numbers. If the usual inner product is used, then Kx is the function whose value is 1 at x and 0 everywhere else, and
K(x,y)
K(x,y)=\begin{cases}1&x=y\ 0&x ≠ y\end{cases}
In this case, H is isomorphic to
\Complexn
The case of
X=D
D
H2(D)
D
H2(D)
K(x,y)= | 1 |
\pi |
1 | |
(1-x\overline{y |
)2}.
Lastly, the space of band limited functions in
L2(\R)
2a
K(x,y)= | \sina(x-y) |
\pi(x-y) |
.
In this section we extend the definition of the RKHS to spaces of vector-valued functions as this extension is particularly important in multi-task learning and manifold regularization. The main difference is that the reproducing kernel
\Gamma
x,y
X
f:X\toRT
c\inRT
x\inX
\Gammaxc(y)=\Gamma(x,y)c\inHfory\inX
and
\langlef,\Gammaxc\rangleH=f(x)\intercalc.
This second property parallels the reproducing property for the scalar-valued case. This definition can also be connected to integral operators, bounded evaluation functions, and feature maps as we saw for the scalar-valued RKHS. We can equivalently define the vvRKHS as a vector-valued Hilbert space with a bounded evaluation functional and show that this implies the existence of a unique reproducing kernel by the Riesz Representation theorem. Mercer's theorem can also be extended to address the vector-valued setting and we can therefore obtain a feature map view of the vvRKHS. Lastly, it can also be shown that the closure of the span of
\{\Gammaxc:x\inX,c\inRT\}
H
We can gain intuition for the vvRKHS by taking a component-wise perspective on these spaces. In particular, we find that every vvRKHS is isometrically isomorphic to a scalar-valued RKHS on a particular input space. Let
Λ=\{1,...,T\}
X x Λ
As noted above, the RKHS associated to this reproducing kernel is given by the closure of the span of
\{\gamma(x,t):x\inX,t\inΛ\}
\gamma(x,t)(y,s)=\gamma((x,t),(y,s))
The connection to the scalar-valued RKHS can then be made by the fact that every matrix-valued kernel can be identified with a kernel of the form of via
\Gamma(x,y)(t,s)=\gamma((x,t),(y,s)).
Moreover, every kernel with the form of defines a matrix-valued kernel with the above expression. Now letting the map
D:H\Gamma\toH\gamma
(Df)(x,t)=\langlef(x),et
\rangle | |
RT |
where
et
tth
RT
D
H\Gamma
H\gamma
While this view of the vvRKHS can be useful in multi-task learning, this isometry does not reduce the study of the vector-valued case to that of the scalar-valued case. In fact, this isometry procedure can make both the scalar-valued kernel and the input space too difficult to work with in practice as properties of the original kernels are often lost.[11] [12] [13]
An important class of matrix-valued reproducing kernels are separable kernels which can factorized as the product of a scalar valued kernel and a
T
\gamma((x,t),(y,s))=K(x,y)KT(t,s)
for all
x,y
X
t,s
T
We lastly remark that the above theory can be further extended to spaces of functions with values in function spaces but obtaining kernels for these spaces is a more difficult task.[14]
The ReLU function is commonly defined as
f(x)=max\{0,x\}
We will work with the Hilbert space
1 | |
l{H}=L | |
2(0)[0, |
infty)
f(0)=0
L2
\langlef,g\ranglel{H
To construct the reproducing kernel it suffices to consider a dense subspace, so let
f\inC1[0,infty)
f(0)=0
f(y)=
y | |
\int | |
0 |
f'(x)dx=
infty | |
\int | |
0 |
G(x,y)f'(x)dx=\langleKy,f\rangle
where
G(x,y)=\begin{cases}1,&x<y\\ 0,&otherwise \end{cases}
and
Ky'(x)=G(x,y), Ky(0)=0
K(x,y)=Ky(x)=\int
x | |
0 |
G(z,y)dz= \begin{cases} x,&0\leqx<y\\ y,&otherwise. \end{cases}=min(x,y)
This implies
Ky=K( ⋅ ,y)
f
Moreover the minimum function on
X x X=[0,infty) x [0,infty)
min(x,y)=x-\operatorname{ReLU}(x-y)=y-\operatorname{ReLU}(y-x).
Using this formulation, we can apply the representer theorem to the RKHS, letting one prove the optimality of using ReLU activations in neural network settings.