Backfitting algorithm explained

In statistics, the backfitting algorithm is a simple iterative procedure used to fit a generalized additive model. It was introduced in 1985 by Leo Breiman and Jerome Friedman along with generalized additive models. In most cases, the backfitting algorithm is equivalent to the Gauss - Seidel method, an algorithm used for solving a certain linear system of equations.

Algorithm

Additive models are a class of non-parametric regression models of the form:

Y_i=\alpha+

	p
\sum
	j=1

f_j(X_ij)+\epsilon_i

where each

X_1,X_2,\ldots,X_p

is a variable in our

-dimensional predictor

, and

is our outcome variable.

\epsilon

represents our inherent error, which is assumed to have mean zero. The

f_j

represent unspecified smooth functions of a single

X_j

. Given the flexibility in the

f_j

, we typically do not have a unique solution:

\alpha

is left unidentifiable as one can add any constants to any of the

f_j

and subtract this value from

\alpha

. It is common to rectify this by constraining

	N
\sum
	i=1

f_j(X_ij)=0

for all

leaving

\alpha=1/N

	N
\sum
	i=1

y_i

necessarily.

The backfitting algorithm is then: Initialize

\hat{\alpha}=1/N

	N
\sum
	i=1

y_i,\hat{f_j}\equiv0

\forallj

Do until

\hat{f_j}

converge: For each predictor j: (a)

\hat{f_j}\leftarrowSmooth[\lbracey_i-\hat{\alpha}-\sum_k\hat{f_k}(x_ik)

	N
\rbrace
	1

]

(backfitting step) (b)

\hat{f_j}\leftarrow\hat{f_j}-1/N

	N
\sum
	i=1

\hat{f_j}(x_ij)

(mean centering of estimated function)

where

Smooth

is our smoothing operator. This is typically chosen to be a cubic spline smoother but can be any other appropriate fitting operation, such as:

local polynomial regression
kernel smoothing methods
more complex operators, such as surface smoothers for second and higher-order interactions

In theory, step (b) in the algorithm is not needed as the function estimates are constrained to sum to zero. However, due to numerical issues this might become a problem in practice.^[1]

Motivation

If we consider the problem of minimizing the expected squared error:

minE[Y-(\alpha+

	p
\sum
	j=1

f_j(X

	2

	j))]

There exists a unique solution by the theory of projections given by:

f_i(X_i)=E[Y-(\alpha+

	p
\sum
	j ≠ i

f_j(X_j))|X_i]

for i = 1, 2, ..., p.

This gives the matrix interpretation:

\begin{pmatrix} I&P₁& … &P₁\\ P₂&I& … &P₂\\ \vdots&&\ddots&\vdots\\ P_p& … &P_p&I\end{pmatrix} \begin{pmatrix} f_1(X_1)\\
f_2(X_2)\\
\vdots\\ f_p(X_{p)
\end{pmatrix}
=
\begin{pmatrix}
P}₁Y\\ P₂Y\\ \vdots\\ P_pY \end{pmatrix}

where

P_{i( ⋅ )}=E( ⋅ |X_i)

. In this context we can imagine a smoother matrix,

S_i

, which approximates our

P_i

and gives an estimate,

S_iY

, of

E(Y|X)

\begin{pmatrix} I&S₁& … &S₁\\ S₂&I& … &S₂\\ \vdots&&\ddots&\vdots\\ S_p& … &S_p&I\end{pmatrix} \begin{pmatrix} f_1\\
f_2\\
\vdots\\ f_{p
\end{pmatrix}
=
\begin{pmatrix}
S}₁Y\\ S₂Y\\ \vdots\\ S_pY \end{pmatrix}

or in abbreviated form

\hat{S}f=QY

An exact solution of this is infeasible to calculate for large np, so the iterative technique of backfitting is used. We take initial guesses

	(0)
f
	j

and update each

	(\ell)
f
	j

in turn to be the smoothed fit for the residuals of all the others:

	(\ell)
\hat{f
	j}

\leftarrowSmooth[\lbracey_i-\hat{\alpha}-\sum_k\hat{f_k}(x_ik)

	N
\rbrace
	1

]

Looking at the abbreviated form it is easy to see the backfitting algorithm as equivalent to the Gauss - Seidel method for linear smoothing operators S.

Explicit derivation for two dimensions

Following,^[2] we can formulate the backfitting algorithm explicitly for the two dimensional case. We have:

f₁=S_1(Y-f_2),f₂=S_2(Y-f₁₎

If we denote

	(i)
\hat{f}
	1

as the estimate of

f₁

in the ith updating step, the backfitting steps are

	(i)
\hat{f}
	1

=S_1[Y-

	(i-1)
\hat{f}
	2

	(i)
\hat{f}
	2

=S_2[Y-

	(i)
\hat{f}
	1

]

By induction we get

	(i)
\hat{f}
	1

=Y-

	i-1
\sum
	\alpha=0

(S₁

	\alpha(I-S
S
	1)Y

-(S₁

	i-1
S
	2)

S_1\hat{f}

	(0)

	2

and

	(i)
\hat{f}
	2

=S₂

	i-1
\sum
	\alpha=0

(S₁

	\alpha(I-S
S
	1)Y

+S_2(S₁

	i-1
S
	2)

S_1\hat{f}

	(0)

	2

If we set

	(0)
\hat{f}
	2

then we get

	(i)
\hat{f}
	1

=Y-

	-1
S
	2

	(i)
\hat{f}
	2

= [I-

	i-1
\sum
	\alpha=0

(S₁

	\alpha(I-S
S
	1)]Y

	(i)
\hat{f}
	2

=[S₂

	i-1
\sum
	\alpha=0

(S₁

	\alpha(I-S
S
	1)]Y

Where we have solved for

	(i)
\hat{f}
	1

by directly plugging out from

f₂=S_2(Y-f₁₎

We have convergence if

\|S₁S_2\|<1

. In this case, letting

	(i)
\hat{f}
	1

	(i)
\hat{f}
	2

\xrightarrow{}

	(infty)
\hat{f}
	1

	(infty)
\hat{f}
	2

	(infty)
\hat{f}
	1

=Y-

	-1
S
	2

	(infty)
\hat{f}
	2

= Y-(I-S₁

	-1
S
	2)

(I-S₁₎Y

	(infty)
\hat{f}
	2

=S₂(I-S₁

	-1
S
	2)

(I-S₁₎Y

We can check this is a solution to the problem, i.e. that

	(i)
\hat{f}
	1

and

	(i)
\hat{f}
	2

converge to

f₁

and

f₂

correspondingly, by plugging these expressions into the original equations.

Issues

The choice of when to stop the algorithm is arbitrary and it is hard to know a priori how long reaching a specific convergence threshold will take. Also, the final model depends on the order in which the predictor variables

X_i

are fit.

As well, the solution found by the backfitting procedure is non-unique. If

is a vector such that

\hat{S}b=0

from above, then if

\hat{f}

is a solution then so is

\hat{f}+\alphab

is also a solution for any

\alpha\inR

. A modification of the backfitting algorithm involving projections onto the eigenspace of S can remedy this problem.

Modified algorithm

We can modify the backfitting algorithm to make it easier to provide a unique solution. Let

l{V}_1(S_i)

be the space spanned by all the eigenvectors of S_i that correspond to eigenvalue 1. Then any b satisfying

\hat{S}b=0

has

b_i\inl{V}_1(S_i)\foralli=1,...,p

and

	p
\sum
	i=1

b_i=0.

Now if we take

to be a matrix that projects orthogonally onto

l{V}_1(S₁₎+...+l{V}_1(S_p)

, we get the following modified backfitting algorithm:

Initialize

\hat{\alpha}=1/N

	N
\sum
	1

y_i,\hat{f_j}\equiv0

\foralli,j

\hat{f_+}=\alpha+\hat{f_1}+...+\hat{f_p}

Do until

\hat{f_j}

converge: Regress

y-\hat{f_+}

onto the space

l{V}_1(S_i)+...+l{V}_1(S_p)

, setting

a=A(Y-\hat{f_+})

For each predictor j: Apply backfitting update to

(Y-a)

using the smoothing operator

(I-A_i)S_i

, yielding new estimates for

\hat{f_j}

References

10.2307/2288473 . Estimating optimal transformations for multiple regression and correlations (with discussion) . 2288473 . Breiman, L. . Friedman, J. H. . amp . Journal of the American Statistical Association . 80 . 391 . 580–619 . 1985.
Generalized Additive Models . Hastie, T. J. . Tibshirani, R. J. . amp . Monographs on Statistics and Applied Probability . 43 . 1990.
Web site: Backfitting . Härdle, Wolfgang . June 9, 2004 . 2015-08-19 . etal . dead . https://web.archive.org/web/20150510225240/http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/spm/spmhtmlnode37.html . 2015-05-10 .

External links

Notes and References

[Trevor Hastie|Hastie, Trevor]
Härdle, Wolfgang; et al. (June 9, 2004). "Backfitting". Archived from the original on 2015-05-10. Retrieved 2015-08-19.