Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a computationally efficient way and allow algorithms to easily swap functions of varying complexity.
In typical machine learning algorithms, these functions produce a scalar output. Recent development of kernel methods for functions with vector-valued output is due, at least in part, to interest in simultaneously solving related problems. Kernels which capture the relationship between the problems allow them to borrow strength from each other. Algorithms of this type include multi-task learning (also called multi-output learning or vector-valued learning), transfer learning, and co-kriging. Multi-label classification can be interpreted as mapping inputs to (binary) coding vectors with length equal to the number of classes.
In Gaussian processes, kernels are called covariance functions. Multiple-output functions correspond to considering multiple processes. See Bayesian interpretation of regularization for the connection between the two perspectives.
The history of learning vector-valued functions is closely linked to transfer learning- storing knowledge gained while solving one problem and applying it to a different but related problem. The fundamental motivation for transfer learning in the field of machine learning was discussed in a NIPS-95 workshop on “Learning to Learn”, which focused on the need for lifelong machine learning methods that retain and reuse previously learned knowledge. Research on transfer learning has attracted much attention since 1995 in different names: learning to learn, lifelong learning, knowledge transfer, inductive transfer, multitask learning, knowledge consolidation, context-sensitive learning, knowledge-based inductive bias, metalearning, and incremental/cumulative learning.[1] Interest in learning vector-valued functions was particularly sparked by multitask learning, a framework which tries to learn multiple, possibly different tasks simultaneously.
Much of the initial research in multitask learning in the machine learning community was algorithmic in nature, and applied to methods such as neural networks, decision trees and -nearest neighbors in the 1990s.[2] The use of probabilistic models and Gaussian processes was pioneered and largely developed in the context of geostatistics, where prediction over vector-valued output data is known as cokriging.[3] [4] [5] Geostatistical approaches to multivariate modeling are mostly formulated around the linear model of coregionalization (LMC), a generative approach for developing valid covariance functions that has been used for multivariate regression and in statistics for computer emulation of expensive multivariate computer codes. The regularization and kernel theory literature for vector-valued functions followed in the 2000s.[6] [7] While the Bayesian and regularization perspectives were developed independently, they are in fact closely related.[8]
In this context, the supervised learning problem is to learn the function
f
yi |
xi |
f(xi) |
=
yi |
i=1,\ldots,N
xi |
\inl{X}
l{X}=Rp
yi |
\inRD
In general, each component of (
yi |
xd,i |
p
l{X}
Here, for simplicity in the notation, we assume the number and sample space of the data for each output are the same.
From the regularization perspective, the problem is to learn
f*
l{H}
Vector-valued case | Scalar case | |||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Reproducing kernel | K:l{X} x l{X} → RD | k:l{X} x l{X} → R | ||||||||||||||||||||||||||||||||||||||||||
Learning problem | f*=\operatorname{argmin}
-yj,i)2+λ\Vertf
| f*=\operatorname{argmin}
-yi)2+λ\Vertf
| ||||||||||||||||||||||||||||||||||||||||||
Solution (derived via the representer theorem \dagger | f*(x)=
with \bar{c where \bar{c ND K(X,X)isanND x ND N x N
| f*(x)=
=
c Solve for c f* c=(K+λI)-1y where Kij=
=ith
|
\dagger
Note, the matrix-valued kernel
K
R
l{X} x \{1,\ldots,D\}
(K(x,x'))d,d'=R((x,d),(x',d'))
The estimator of the vector-valued regularization framework can also be derived from a Bayesian viewpoint using Gaussian process methods in the case of a finite dimensional Reproducing kernel Hilbert space. The derivation is similar to the scalar-valued case Bayesian interpretation of regularization. The vector-valued function
bf{f}
D
\left\{fd\right\}
D | |
d=1 |
bf{f}\siml{GP}(bf{m},bf{K})
where
bf{m}:l{X}\tobf{R}D
\left\{md(bf{x})\right\}
D | |
d=1 |
bf{K}
(bf{K}(bf{x},bf{x}'))d,d'
fd(bf{x})
fd'(bf{x}')
For a set of inputs
bf{X}
bf{f}(bf{X})
l{N}(bf{m}(bf{X}),bf{K}(bf{X},bf{X}))
bf{m}(bf{X})
bf{K}(bf{X},bf{X})
p(bf{y}\midbf{f},bf{x},\Sigma)=l{N}(bf{f}(bf{x}),\Sigma)
where
\Sigma\inl{bf{R}}D
D | |
\left\{\sigma | |
d=1 |
bf{x}*
p(bf{f}(bf{x}*)\midbf{S},bf{f},bf{x}*,\phi)=l{N}(bf{f}*(bf{x}*),bf{K}*(bf{x}*,bf{x}*))
where
bf{S}
\phi
bf{K}(bf{x},bf{x}')
\Sigma
Equations for
bf{f}*
bf{K}*
bf{f}*(bf{x}*)=bf{K}bf{x
T(bf{K}(bf{X},bf{X})+\boldsymbol\Sigma) | |
*} |
-1\bar{bf{y}}
bf{K}*(bf{x}*,bf{x}*)=bf{K}(bf{x}*,bf{x}*)-bf{K}bf{x
-1 | |
*}(bf{K}(bf{X},bf{X})+\boldsymbol\Sigma) |
bf{K}bf{x
T | |
*} |
where
\boldsymbol\Sigma=\Sigma ⊗ bf{I}N,bf{K}bf{x*}\inl{bf{R}}D
(bf{K}(bf{x}*,bf{x}j))d,d'
j=1, … ,N
d,d'=1, … ,D
bf{f}*
A simple, but broadly applicable, class of multi-output kernels can be separated into the product of a kernel on the input-space and a kernel representing the correlations among the outputs:
(K(x,x'))d,d'=k(x,x')kT(d,d')
k
l{X} x l{X}
kT
\{1,\ldots,D\} x \{1,\ldots,D\}
In matrix form:
K(x,x')=k(x,x')B
B
D x D
B
For a slightly more general form, adding several of these kernels yields sum of separable kernels (SoS kernels).
One way of obtaining
kT
f
Mixed-effect regularizer
R(f)=A\omega(C\omega
D | |
\sum\limits | |
l=1 |
\|fl
2 | |
\| | |
k |
+\omegaD
D | |
\sum\limits | |
l=1 |
\|fl-\bar{f}
2) | |
\| | |
k |
A\omega=
1 | |
2(1-\omega)(1-\omega+\omegaD) |
C\omega=(2-2\omega+\omegaD)
\bar{f}=
1 | |
D |
D | |
\sum\limits | |
q=1 |
fq
K\omega(x,x')=k(x,x')(\omega1+(1-\omega)ID
where
1isaD x D
This regularizer is a combination of limiting the complexity of each component of the estimator (
fl
\omega=0
\omega=1
Cluster-based regularizer
R(f)=\varepsilon1
r | |
\sum | |
c=1 |
\suml\|fl-\bar{fc}\|
2 | |
k |
+\varepsilon2
r | |
\sum\limits | |
c=1 |
mc\|\bar{fc}\|
2 | |
k |
where:
I(c)
c
mc
c
\bar{fc}=
1 | |
mc |
\sum\limitsqfq
Ml,q=
1 | |
mc |
l
q
c
Ml,q=0
K(x,x')=k(x,x')G\dagger
where
Gl,q=\varepsilon1\deltalq+(\varepsilon2-\varepsilon1)Ml,q
This regularizer divides the components into
r
Graph regularizer
R(f)=
1 | |
2 |
D | |
\sum\limits | |
l,q=1 |
\Vertfl-fq
2 | |
\Vert | |
k |
Mlq+
D | |
\sum\limits | |
l=1 |
\Vertfl
2 | |
\Vert | |
k |
Ml,l
where
MisaD x D
K(x,x')=k(x,x')L\dagger
where
L=D-M
Dl,q=\deltal,q
D | |
(\sum\limits | |
h=1 |
Ml,h+Ml,q)
Note,
L
Several approaches to learning
B
B
B
f
In LMC, outputs are expressed as linear combinations of independent random functions such that the resulting covariance function (over all inputs and outputs) is a valid positive semidefinite function. Assuming
D
\left\{fd(bf{x})\right\}
D | |
d=1 |
bf{x}\inl{bf{R}}p
fd
fd(bf{x})=
Q{a | |
\sum | |
d,q |
uq(bf{x})}
where
ad,q
uq(bf{x})
[uq(bf{x}),uq'(bf{x}')]=kq(bf{x},bf{x}')
q=q'
fd(bf{x})
fd'(bf{x})
\operatorname{cov}[fd(bf{x}),fd'(bf{x}')]=
Rq | |
\sum | |
i=1 |
i | |
{a | |
d',q |
kq(bf{x},bf{x}')}}=\sum
qk | |
q(bf{x},bf{x}')} |
where the functions
i(bf{x}) | |
u | |
q |
q=1, … ,Q
i=1, … ,Rq
i' | |
[u | |
q' |
(bf{x})']=kq(bf{x},bf{x}')
i=i'
q=q'
\operatorname{cov}[fd(bf{x}),fd'(bf{x}')]
(bf{K}(bf{x},bf{x}'))d,d'
bf{K}(bf{x},bf{x}')
Q{bf{B} | |
bf{K}(bf{x},bf{x}')=\sum | |
qk |
q(bf{x},bf{x}')}
where each
bf{B}q\inl{bf{R}}D
bf{x}
bf{B}q
\left\{fd(bf{x})\right\}
D | |
d=1 |
kq(bf{x},bf{x}')
The ICM is a simplified version of the LMC, with
Q=1
q | |
b | |
d,d' |
Bq
q | |
b | |
d,d' |
=vd,d'bq
vd,d'
q | |
b | |
d,d' |
\operatorname{cov}\left[fd(x),fd'(x')\right]=
Q{v | |
\sum | |
d,d' |
bqkq(x,x')}=vd,d'
Q{b | |
\sum | |
qk |
q(x,x')}=vd,d'k(x,x')
where
k(x,x')=
Q{b | |
\sum | |
qk |
q(x,x')}.
In this case, the coefficients
vd,d'=
R1 | |
\sum | |
i=1 |
i} | |
{a | |
d',1 |
=
1 | |
b | |
d,d' |
and the kernel matrix for multiple outputs becomes
K(x,x')=k(x,x')B
kq(x,x')
Another simplified version of the LMC is the semiparametric latent factor model (SLFM), which corresponds to setting
Rq=1
Q=1
uq
While simple, the structure of separable kernels can be too limiting for some problems.
Notable examples of non-separable kernels in the regularization literature include:
In the Bayesian perspective, LMC produces a separable kernel because the output functions evaluated at a point
bf{x}
bf{x}
When implementing an algorithm using any of the kernels above, practical considerations of tuning the parameters and ensuring reasonable computation time must be considered.
Approached from the regularization perspective, parameter tuning is similar to the scalar-valued case and can generally be accomplished with cross validation. Solving the required linear system is typically expensive in memory and time. If the kernel is separable, a coordinate transform can convert
K(X,X)
B
\bar{c
\bar{c
There are many works related to parameter estimation for Gaussian processes. Some methods such as maximization of the marginal likelihood (also known as evidence approximation, type II maximum likelihood, empirical Bayes), and least squares give point estimates of the parameter vector
\phi
\phi
The main computational problem in the Bayesian viewpoint is the same as the one appearing in regularization theory of inverting the matrix
\overline{K(X,X)}=K(X,X)+\boldsymbol{\Sigma}.
This step is necessary for computing the marginal likelihood and the predictive distribution. For most proposed approximation methods to reduce computation, the computational efficiency gained is independent of the particular method employed (e.g. LMC, process convolution) used to compute the multi-output covariance matrix. A summary of different methods for reducing computational complexity in multi-output Gaussian processes is presented in.[8]