In statistics, a generalized additive model (GAM) is a generalized linear model in which the linear response variable depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.GAMs were originally developed by Trevor Hastie and Robert Tibshirani[1] to blend properties of generalized linear models with additive models. They can be interpreted as the discriminative generalization of the naive Bayes generative model.[2]
The model relates a univariate response variable, Y, to some predictor variables, xi. An exponential family distribution is specified for Y (for example normal, binomial or Poisson distributions) along with a link function g (for example the identity or log functions) relating the expected value of Y to the predictor variables via a structure such as
g(\operatorname{E}(Y))=\beta0+f1(x1)+f2(x2)+ … +fm(xm).
The functions fi may be functions with a specified parametric form (for example a polynomial, or an un-penalized regression spline of a variable) or may be specified non-parametrically, or semi-parametrically, simply as 'smooth functions', to be estimated by non-parametric means. So a typical GAM might use a scatterplot smoothing function, such as a locally weighted mean, for f1(x1), and then use a factor model for f2(x2). This flexibility to allow non-parametric fits with relaxed assumptions on the actual relationship between response and predictor, provides the potential for better fits to data than purely parametric models, but arguably with some loss of interpretability.
It had been known since the 1950s (via the Kolmogorov–Arnold representation theorem) that any multivariate continuous function could be represented as sums and compositions of univariate functions,
f(\vecx)=
2n | |
\sum | |
q=0 |
\Phiq
n | |
\left(\sum | |
p=1 |
\phiq,p(xp)\right)
Unfortunately, though the Kolmogorov–Arnold representation theorem asserts the existence of a function of this form, it gives no mechanism whereby one could be constructed. Certain constructive proofs exist, but they tend to require highly complicated (i.e. fractal) functions, and thus are not suitable for modeling approaches. Therefore, the generalized additive model drops the outer sum, and demands instead that the function belong to a simpler class,
f(\vecx)=
n | |
\Phi\left(\sum | |
p=1 |
\phip(xp)\right)
where
\Phi
g
\Phi
g(f(\vecx))=\sumifi(xi)
When this function is approximating the expectation of some observed quantity, it could be written as
g(\operatorname{E}(Y))=\beta0+f1(x1)+f2(x2)+ … +fm(xm).
Which is the standard formulation of a generalized additive model. It was then shown that the backfitting algorithm will always converge for these functions.
The GAM model class is quite broad, given that smooth function is a rather broad category. For example, a covariate
xj
fj
fj
zjfj(xj)
zj
xj
xj(t)
\intfj(t)xj(t)dt
fj
The original GAM fitting method estimated the smooth components of the model using non-parametric smoothers (for example smoothing splines or local linear regression smoothers) via the backfitting algorithm. Backfitting works by iterative smoothing of partial residuals and provides a very general modular estimation method capable of using a wide variety of smoothing methods to estimate the
fj(xj)
If the
fj(xj)
O(n3)
n
An alternative approach with particular advantages in high dimensional settings is to use boosting, although this typically requires bootstrapping for uncertainty quantification.[14] [15] GAMs fit using bagging and boosting have been found to generally outperform GAMs fit using spline methods.[16]
Many modern implementations of GAMs and their extensions are built around the reduced rank smoothing approach, because it allows well founded estimation of the smoothness of the component smooths at comparatively modest computational cost, and also facilitates implementation of a number of model extensions in a way that is more difficult with other methods. At its simplest the idea is to replace the unknown smooth functions in the model with basis expansions
fj(xj)=
Kj | |
\sum | |
k=1 |
\betajkbjk(xj)
bjk(xj)
\betajk
Kj
p=\sumjKj
O(np2)
Notice that the
fj
f1
f2
fj
\sumifj(xji)=0
fj
Having replaced all the
fj
xj
Kj
\beta
D(\beta)
\beta
D(\beta)+\sumjλj\int
2 | |
f | |
j(x) |
dx
fj
λj
λj\toinfty
fj(xj)
xj
Given the basis expansion for each
fj
\int
2 | |
f | |
j(x) |
dx=
T | |
\beta | |
j |
\barSj\betaj=\betaTSj\beta
\barSj
\betaj
fj
Sj
\barSj
\beta
\hat\beta=argmin\beta\{D(\beta)+\sumjλj\betaTSj\beta\}
Penalization has several effects on inference, relative to a regular GLM. For one thing the estimates are subject to some smoothing bias, which is the price that must be paid for limiting estimator variance by penalization. However, if smoothing parameters are selected appropriately the (squared) smoothing bias introduced by penalization should be less than the reduction in variance that it produces, so that the net effect is a reduction in mean square estimation error, relative to not penalizing. A related effect of penalization is that the notion of degrees of freedom of a model has to be modified to account for the penalties' action in reducing the coefficients' freedom to vary. For example, if
W
X
trace(F)
F=(XTWX+\sumjλj
-1 | |
S | |
j) |
XTWX
F
fj
fj
Smoothing bias complicates interval estimation for these models, and the simplest approach turns out to involve a Bayesian approach.[17] [18] [19] [20] Understanding this Bayesian view of smoothing also helps to understand the REML and full Bayes approaches to smoothing parameter estimation. At some level smoothing penalties are imposed because we believe smooth functions to be more probable than wiggly ones, and if that is true then we might as well formalize this notion by placing a prior on model wiggliness. A very simple prior might be
\pi(\beta)\propto\exp\{-
T\sum | |
\beta | |
j |
λjSj\beta/(2\phi)\}
\phi
0
Sλ=\sumjλjSj/\phi
Sλ
Sλ
Now if this prior is combined with the GLM likelihood, we find that the posterior mode for
\beta
\hat\beta
\beta|y\simN(\hat\beta,(XTWX+
-1 | |
S | |
λ) |
\phi).
fj
So far we have treated estimation and inference given the smoothing parameters,
λ
\beta
\beta,y
\hatλ=argmaxλ\intf(y|\beta,λ)\pi(\beta|λ)d\beta
f(y|\beta,λ)
\beta
λ
Smoothing parameter inference is the most computationally taxing part of model estimation/inference. For example, to optimize a GCV or marginal likelihood typically requires numerical optimization via a Newton or Quasi-Newton method, with each trial value for the (log) smoothing parameter vector requiring a penalized IRLS iteration to evaluate the corresponding
\hat\beta
\hat\beta
Backfit GAMs were originally provided by the gam
function in S,[23] now ported to the R language as the gam
package. The SAS proc GAM
also provides backfit GAMs. The recommended package in R for GAMs is mgcv
, which stands for mixed GAM computational vehicle,[11] which is based on the reduced rank approach with automatic smoothing parameter selection. The SAS proc GAMPL
is an alternative implementation. In Python, there is the PyGAM package, with similar features to R's mgcv. Alternatively, there is InterpretML
package, which implements a bagging and boosting approach.[24] There are many alternative packages. Examples include the R packages mboost
,[14] which implements a boosting approach; gss
, which provides the full spline smoothing methods;[25] VGAM
which provides vector GAMs;[4] and gamlss
, which provides Generalized additive model for location, scale and shape. BayesX
and its R interface provides GAMs and extensions via MCMC and penalized likelihood methods.[26] The INLA
software implements a fully Bayesian approach based on Markov random field representations exploiting sparse matrix methods.[13]
As an example of how models can be estimated in practice with software, consider R package mgcv
. Suppose that our R workspace contains vectors y, x and z and we want to estimate the model
yi=\beta0+f1(xi)+f2(zi)+\epsiloniwhere\epsiloni\simN(0,\sigma2).
gam
expects a model formula to be supplied, specifying the model structure to fit. The response variable is given to the left of the ~
while the specification of the linear predictor is given to the right. gam
sets up bases and penalties for the smooth terms, estimates the model including its smoothing parameters and, in standard R fashion, returns a fitted model object, which can then be interrogated using various helper functions, such as summary
, plot
, predict
, and AIC
.This simple example has used several default settings which it is important to be aware of. For example a Gaussian distribution and identity link has been assumed, and the smoothing parameter selection criterion was GCV. Also the smooth terms were represented using `penalized thin plate regression splines', and the basis dimension for each was set to 10 (implying a maximum of 9 degrees of freedom after identifiability constraints have been imposed). A second example illustrates how we can control these things. Suppose that we want to estimate the model
yi\simPoi(\mui)wherelog\mui=\beta0+\beta1xi+f1(ti)+f2(vi,wi).
f1
f2
v
w
v
w
t
These examples are only intended to give a very basic flavour of the way that GAM software is used, for more detail refer to the software documentation for the various packages and the references below.[11] [25] [4] [23] [14] [26]
As with any statistical model it is important to check the model assumptions of a GAM. Residual plots should be examined in the same way as for any GLM. That is deviance residuals (or other standardized residuals) should be examined for patterns that might suggest a substantial violation of the independence or mean-variance assumptions of the model. This will usually involve plotting the standardized residuals against fitted values and covariates to look for mean-variance problems or missing pattern, and may also involve examining Correlograms (ACFs) and/or Variograms of the residuals to check for violation of independence. If the model mean-variance relationship is correct then scaled residuals should have roughly constant variance. Note that since GLMs and GAMs can be estimated using Quasi-likelihood, it follows that details of the distribution of the residuals beyond the mean-variance relationship are of relatively minor importance.
One issue that is more common with GAMs than with other GLMs is a danger of falsely concluding that data are zero inflated. The difficulty arises when data contain many zeroes that can be modelled by a Poisson or binomial with a very low expected value: the flexibility of the GAM structure will often allow representation of a very low mean over some region of covariate space, but the distribution of standardized residuals will fail to look anything like the approximate normality that introductory GLM classes teach us to expect, even if the model is perfectly correct.[27]
The one extra check that GAMs introduce is the need to check that the degrees of freedom chosen are appropriate. This is particularly acute when using methods that do not automatically estimate the smoothness of model components. When using methods with automatic smoothing parameter selection then it is still necessary to check that the choice of basis dimension was not restrictively small, although if the effective degrees of freedom of a term estimate is comfortably below its basis dimension then this is unlikely. In any case, checking
fj(xj)
xj
\hatfj(xj)
When smoothing parameters are estimated as part of model fitting then much of what would traditionally count as model selection has been absorbed into the fitting process: the smoothing parameters estimation has already selected between a rich family of models of different functional complexity. However smoothing parameter estimation does not typically remove a smooth term from the model altogether, because most penalties leave some functions un-penalized (e.g. straight lines are unpenalized by the spline derivative penalty given above). So the question of whether a term should be in the model at all remains. One simple approach to this issue is to add an extra penalty to each smooth term in the GAM, which penalizes the components of the smooth that would otherwise be unpenalized (and only those). Each extra penalty has its own smoothing parameter and estimation then proceeds as before, but now with the possibility that terms will be completely penalized to zero.[28] In high dimensional settings then it may make more sense to attempt this task using the lasso or elastic net regularization. Boosting also performs term selection automatically as part of fitting.[14]
An alternative is to use traditional stepwise regression methods for model selection. This is also the default method when smoothing parameters are not estimated as part of fitting, in which case each smooth term is usually allowed to take one of a small set of pre-defined smoothness levels within the model, and these are selected between in a stepwise fashion. Stepwise methods operate by iteratively comparing models with or without particular model terms (or possibly with different levels of term complexity), and require measures of model fit or term significance in order to decide which model to select at each stage. For example, we might use p-values for testing each term for equality to zero to decide on candidate terms for removal from a model, and we might compare Akaike information criterion (AIC) values for alternative models.
P-value computation for smooths is not straightforward, because of the effects of penalization, but approximations are available.[1] [11] AIC can be computed in two ways for GAMs. The marginal AIC is based on the Mariginal Likelihood (see above) with the model coefficients integrated out. In this case the AIC penalty is based on the number of smoothing parameters (and any variance parameters) in the model. However, because of the well known fact that REML is not comparable between models with different fixed effects structures, we can not usually use such an AIC to compare models with different smooth terms (since their un-penalized components act like fixed effects). Basing AIC on the marginal likelihood in which only the penalized effects are integrated out is possible (the number of un-penalized coefficients now gets added to the parameter count for the AIC penalty), but this version of the marginal likelihood suffers from the tendency to oversmooth that provided the original motivation for developing REML. Given these problems GAMs are often compared using the conditional AIC, in which the model likelihood (not marginal likelihood) is used in the AIC, and the parameter count is taken as the effective degrees of freedom of the model.[1] [22]
Naive versions of the conditional AIC have been shown to be much too likely to select larger models in some circumstances, a difficulty attributable to neglect of smoothing parameter uncertainty when computing the effective degrees of freedom,[29] however correcting the effective degrees of freedom for this problem restores reasonable performance.[3]
Overfitting can be a problem with GAMs,[22] especially if there is un-modelled residual auto-correlation or un-modelled overdispersion. Cross-validation can be used to detect and/or reduce overfitting problems with GAMs (or other statistical methods),[30] and software often allows the level of penalization to be increased to force smoother fits. Estimating very large numbers of smoothing parameters is also likely to be statistically challenging, and there are known tendencies for prediction error criteria (GCV, AIC etc.) to occasionally undersmooth substantially, particularly at moderate sample sizes, with REML being somewhat less problematic in this regard.[31]
Where appropriate, simpler models such as GLMs may be preferable to GAMs unless GAMs improve predictive ability substantially (in validation sets) for the application in question.