Spectral regularization is any of a class of regularization techniques used in machine learning to control the impact of noise and prevent overfitting. Spectral regularization can be used in a broad range of applications, from deblurring images to classifying emails into a spam folder and a non-spam folder. For instance, in the email classification example, spectral regularization can be used to reduce the impact of noise and prevent overfitting when a machine learning system is being trained on a labeled set of emails to learn how to tell a spam and a non-spam email apart.
Spectral regularization algorithms rely on methods that were originally defined and studied in the theory of ill-posed inverse problems (for instance, see[1]) focusing on the inversion of a linear operator (or a matrix) that possibly has a bad condition number or an unbounded inverse. In this context, regularization amounts to substituting the original operator by a bounded operator called the "regularization operator" that has a condition number controlled by a regularization parameter,[2] a classical example being Tikhonov regularization. To ensure stability, this regularization parameter is tuned based on the level of noise. The main idea behind spectral regularization is that each regularization operator can be described using spectral calculus as an appropriate filter on the eigenvalues of the operator that defines the problem, and the role of the filter is to "suppress the oscillatory behavior corresponding to small eigenvalues". Therefore, each algorithm in the class of spectral regularization algorithms is defined by a suitable filter function (which needs to be derived for that particular algorithm). Three of the most commonly used regularization algorithms for which spectral filtering is well-studied are Tikhonov regularization, Landweber iteration, and truncated singular value decomposition (TSVD). As for choosing the regularization parameter, examples of candidate methods to compute this parameter include the discrepancy principle, generalized cross validation, and the L-curve criterion.[3]
It is of note that the notion of spectral filtering studied in the context of machine learning is closely connected to the literature on function approximation (in signal processing).
The training set is defined as
S=\{(x1,y1),...,(xn,yn)\}
X
n x d
Y=(y1,...,yn)
k
n x n
K
Kij=k(xi,xj)
l{H}
k
λ
(Note: For
g\inG
f\inF
G
F
L
g=Lf
g
f
f
g
f
The connection between the regularized least squares (RLS) estimation problem (Tikhonov regularization setting) and the theory of ill-posed inverse problems is an example of how spectral regularization algorithms are related to the theory of ill-posed inverse problems.
The RLS estimator solvesand the RKHS allows for expressing this RLS estimator as
λ | |
f | |
S |
n | |
(X)=\sum | |
i=1 |
cik(x,xi)
(K+nλI)c=Y
c=(c1,...,cn)
minf\inl{H
λ | |
f | |
S |
n | |
(X)=\sum | |
i=1 |
cik(x,xi)
Kc=Y
In this learning setting, the kernel matrix can be decomposed as
K=Q\SigmaQT
q1,...,qn
Thus, for small eigenvalues, even small perturbations in the data can lead to considerable changes in the solution. Hence, the problem is ill-conditioned, and solving this RLS problem amounts to stabilizing a possibly ill-conditioned matrix inversion problem, which is studied in the theory of ill-posed inverse problems; in both problems, a main concern is to deal with the issue of numerical stability.
Each algorithm in the class of spectral regularization algorithms is defined by a suitable filter function, denoted here by
Gλ( ⋅ )
K
λ
Gλ(K)
λ | |
f | |
S |
(X):=
n | |
\sum | |
i=1 |
cik(x,xi)
c=Gλ(K)Y
Gλ(\sigma)
Typically, an appropriate filter function should have the following properties:
λ
Gλ(\sigma)~ → ~1/\sigma
Gλ
λ
While the above items give a rough characterization of the general properties of filter functions for all spectral regularization algorithms, the derivation of the filter function (and hence its exact form) varies depending on the specific regularization method that spectral filtering is applied to.
In the Tikhonov regularization setting, the filter function for RLS is described below. As shown in, in this setting,
c=\left(K+nλI\right)-1Y
The undesired components are filtered out using regularization:
\sigma\ggλn
1 | |
\sigmai+nλ |
\sim
1 | |
\sigmai |
\sigma\llλn
1 | |
\sigmai+nλ |
\sim
1 | |
λn |
The idea behind the Landweber iteration is gradient descent: c0 := 0 for i = 1, ..., t − 1 ci := ci−1 + η(Y − Kci−1) endIn this setting, if
n
K
η=2/n
1 | |
n |
\left\|Y-
2 | |
Kc\right\| | |
2 |
t
Thus, the appropriate filter function is defined by:
It can be shown that this filter function corresponds to a truncated power expansion of
K-1
\sumi\geq0xi=1/(1-x)
x
K
I-ηK
In this setting, the number of iterations gives the regularization parameter; roughly speaking,
t\sim1/λ
t
t
In the TSVD setting, given the eigen-decomposition
K=Q\SigmaQT
λn
It can be shown that TSVD is equivalent to the (unsupervised) projection of the data using (kernel) Principal Component Analysis (PCA), and that it is also equivalent to minimizing the empirical risk on the projected data (without regularization). Note that the number of components kept for the projection is the only free parameter here.