In statistics, the backfitting algorithm is a simple iterative procedure used to fit a generalized additive model. It was introduced in 1985 by Leo Breiman and Jerome Friedman along with generalized additive models. In most cases, the backfitting algorithm is equivalent to the Gauss - Seidel method, an algorithm used for solving a certain linear system of equations.
Additive models are a class of non-parametric regression models of the form:
Yi=\alpha+
p | |
\sum | |
j=1 |
fj(Xj)+\epsiloni
where each
X1,X2,\ldots,Xp
p
X
Y
\epsilon
fj
Xj
fj
\alpha
fj
\alpha
N | |
\sum | |
i=1 |
fj(Xij)=0
j
leaving
\alpha=1/N
N | |
\sum | |
i=1 |
yi
necessarily.
The backfitting algorithm is then: Initialize
\hat{\alpha}=1/N
N | |
\sum | |
i=1 |
yi,\hat{fj}\equiv0
\forallj
\hat{fj}
\hat{fj}\leftarrowSmooth[\lbraceyi-\hat{\alpha}-\sumk\hat{fk}(xik)
N | |
\rbrace | |
1 |
]
\hat{fj}\leftarrow\hat{fj}-1/N
N | |
\sum | |
i=1 |
\hat{fj}(xij)
where
Smooth
In theory, step (b) in the algorithm is not needed as the function estimates are constrained to sum to zero. However, due to numerical issues this might become a problem in practice.[1]
If we consider the problem of minimizing the expected squared error:
minE[Y-(\alpha+
p | |
\sum | |
j=1 |
fj(X
2 | |
j))] |
There exists a unique solution by the theory of projections given by:
fi(Xi)=E[Y-(\alpha+
p | |
\sum | |
j ≠ i |
fj(Xj))|Xi]
for i = 1, 2, ..., p.
This gives the matrix interpretation:
\begin{pmatrix} I&P1& … &P1\\ P2&I& … &P2\\ \vdots&&\ddots&\vdots\\ Pp& … &Pp&I\end{pmatrix} \begin{pmatrix} f1(X1)\\ f2(X2)\\ \vdots\\ fp(Xp) \end{pmatrix} = \begin{pmatrix} P1Y\\ P2Y\\ \vdots\\ PpY \end{pmatrix}
where
Pi( ⋅ )=E( ⋅ |Xi)
Si
Pi
SiY
E(Y|X)
\begin{pmatrix} I&S1& … &S1\\ S2&I& … &S2\\ \vdots&&\ddots&\vdots\\ Sp& … &Sp&I\end{pmatrix} \begin{pmatrix} f1\\ f2\\ \vdots\\ fp \end{pmatrix} = \begin{pmatrix} S1Y\\ S2Y\\ \vdots\\ SpY \end{pmatrix}
or in abbreviated form
\hat{S}f=QY
An exact solution of this is infeasible to calculate for large np, so the iterative technique of backfitting is used. We take initial guesses
(0) | |
f | |
j |
(\ell) | |
f | |
j |
(\ell) | |
\hat{f | |
j} |
\leftarrowSmooth[\lbraceyi-\hat{\alpha}-\sumk\hat{fk}(xik)
N | |
\rbrace | |
1 |
]
Looking at the abbreviated form it is easy to see the backfitting algorithm as equivalent to the Gauss - Seidel method for linear smoothing operators S.
Following,[2] we can formulate the backfitting algorithm explicitly for the two dimensional case. We have:
f1=S1(Y-f2),f2=S2(Y-f1)
If we denote
(i) | |
\hat{f} | |
1 |
f1
(i) | |
\hat{f} | |
1 |
=S1[Y-
(i-1) | |
\hat{f} | |
2 |
],
(i) | |
\hat{f} | |
2 |
=S2[Y-
(i) | |
\hat{f} | |
1 |
]
By induction we get
(i) | |
\hat{f} | |
1 |
=Y-
i-1 | |
\sum | |
\alpha=0 |
(S1
\alpha(I-S | |
S | |
1)Y |
-(S1
i-1 | |
S | |
2) |
S1\hat{f}
(0) | |
2 |
and
(i) | |
\hat{f} | |
2 |
=S2
i-1 | |
\sum | |
\alpha=0 |
(S1
\alpha(I-S | |
S | |
1)Y |
+S2(S1
i-1 | |
S | |
2) |
S1\hat{f}
(0) | |
2 |
If we set
(0) | |
\hat{f} | |
2 |
=0
(i) | |
\hat{f} | |
1 |
=Y-
-1 | |
S | |
2 |
(i) | |
\hat{f} | |
2 |
= [I-
i-1 | |
\sum | |
\alpha=0 |
(S1
\alpha(I-S | |
S | |
1)]Y |
(i) | |
\hat{f} | |
2 |
=[S2
i-1 | |
\sum | |
\alpha=0 |
(S1
\alpha(I-S | |
S | |
1)]Y |
Where we have solved for
(i) | |
\hat{f} | |
1 |
f2=S2(Y-f1)
We have convergence if
\|S1S2\|<1
(i) | |
\hat{f} | |
1 |
,
(i) | |
\hat{f} | |
2 |
\xrightarrow{}
(infty) | |
\hat{f} | |
1 |
,
(infty) | |
\hat{f} | |
2 |
(infty) | |
\hat{f} | |
1 |
=Y-
-1 | |
S | |
2 |
(infty) | |
\hat{f} | |
2 |
= Y-(I-S1
-1 | |
S | |
2) |
(I-S1)Y
(infty) | |
\hat{f} | |
2 |
=S2(I-S1
-1 | |
S | |
2) |
(I-S1)Y
We can check this is a solution to the problem, i.e. that
(i) | |
\hat{f} | |
1 |
(i) | |
\hat{f} | |
2 |
f1
f2
The choice of when to stop the algorithm is arbitrary and it is hard to know a priori how long reaching a specific convergence threshold will take. Also, the final model depends on the order in which the predictor variables
Xi
As well, the solution found by the backfitting procedure is non-unique. If
b
\hat{S}b=0
\hat{f}
\hat{f}+\alphab
\alpha\inR
We can modify the backfitting algorithm to make it easier to provide a unique solution. Let
l{V}1(Si)
\hat{S}b=0
bi\inl{V}1(Si)\foralli=1,...,p
p | |
\sum | |
i=1 |
bi=0.
A
l{V}1(S1)+...+l{V}1(Sp)
Initialize
\hat{\alpha}=1/N
N | |
\sum | |
1 |
yi,\hat{fj}\equiv0
\foralli,j
\hat{f+}=\alpha+\hat{f1}+...+\hat{fp}
\hat{fj}
y-\hat{f+}
l{V}1(Si)+...+l{V}1(Sp)
a=A(Y-\hat{f+})
(Y-a)
(I-Ai)Si
\hat{fj}