f:Rp\toR
Let
K | |
hλ |
(X0,X)
K | |
hλ |
(X0,X)=D\left(
\left\|X-X0\right\| | |
hλ(X0) |
\right)
where:
X,X0\inRp
\left\| ⋅ \right\|
hλ(X0)
Popular kernels used for smoothing include parabolic (Epanechnikov), Tricube, and Gaussian kernels.
Let
Y(X):Rp\toR
X0\inRp
\hat{Y}(X0)=
| ||||||||||||||
where:
In the following sections, we describe some particular cases of kernel smoothers.
The Gaussian kernel is one of the most widely used kernels, and is expressed with the equation below.
*,x | ||||||||||
K(x | ||||||||||
|
\right)
Here, b is the length scale for the input space.
The k-nearest neighbor algorithm can be used for defining a k-nearest neighbor smoother as follows. For each point X0, take m nearest neighbors and estimate the value of Y(X0) by averaging the values of these neighbors.
Formally,
hm(X0)=\left\|X0-X[m]\right\|
X[m]
D(t)=\begin{cases} 1/m&if|t|\le1\\ 0&otherwise \end{cases}
Example:
In this example, X is one-dimensional. For each X0, the
\hat{Y}(X0)
The idea of the kernel average smoother is the following. For each data point X0, choose a constant distance size λ (kernel radius, or window width for p = 1 dimension), and compute a weighted average for all data points that are closer than
λ
Formally,
hλ(X0)=λ=constant,
Example:
For each X0 the window width is constant, and the weight of each point in the window is schematically denoted by the yellow figure in the graph. It can be seen that the estimation is smooth, but the boundary points are biased. The reason for that is the non-equal number of points (from the right and from the left to the X0) in the window, when the X0 is close enough to the boundary.
See main article: Local regression.
In the two previous sections we assumed that the underlying Y(X) function is locally constant, therefore we were able to use the weighted average for the estimation. The idea of local linear regression is to fit locally a straight line (or a hyperplane for higher dimensions), and not the constant (horizontal line). After fitting the line, the estimation
\hat{Y}(X0)
\hat{Y}(X)
hλ(X0)=λ=constant.
For one dimension (p = 1):
\begin{align} &
min | |
\alpha(X0),\beta(X0) |
N | |
\sum\limits | |
i=1 |
{K | |
hλ |
(X0,Xi)\left(Y(Xi)-\alpha(X0)-\beta(X0)Xi\right)2}\ &\Downarrow\ &\hat{Y}(X0)=\alpha(X0)+\beta(X0)X0\ \end{align}
The closed form solution is given by:
\hat{Y}(X0)=\left(1,X0\right)\left(BTW(X0)B\right)-1BTW(X0)y
where:
y=\left(Y(X1),...,Y(XN)\right)T
W(X0)=\operatorname{diag}\left(
K | |
hλ |
(X0,Xi)\right)N x
BT=\left(\begin{matrix} 1&1&...&1\\ X1&X2&...&XN\\ \end{matrix}\right)
The resulting function is smooth, and the problem with the biased boundary points is reduced.
Local linear regression can be applied to any-dimensional space, though the question of what is a local neighborhood becomes more complicated. It is common to use k nearest training points to a test point to fit the local linear regression. This can lead to high variance of the fitted function. To bound the variance, the set of training points should contain the test point in their convex hull (see Gupta et al. reference).
Instead of fitting locally linear functions, one can fit polynomial functions.For p=1, one should minimize:
\underset{\alpha(X0),\betaj(X0),j=1,...,d}{min
with
\hat{Y}(X0)=\alpha(X0
d | |
)+\sum\limits | |
j=1 |
{\betaj(X0
j | |
)X | |
0 |
In general case (p>1), one should minimize:
\begin{align} &\hat{\beta}(X0)=\underset{\beta(X0)}{\argmin