The generalized functional linear model (GFLM) is an extension of the generalized linear model (GLM) that allows one to regress univariate responses of various types (continuous or discrete) on functional predictors, which are mostly random trajectories generated by a square-integrable stochastic processes. Similarly to GLM, a link function relates the expected value of the response variable to a linear predictor, which in case of GFLM is obtained by forming the scalar product of the random predictor function
X
\beta
A key aspect of GFLM is estimation and inference for the smooth parameter function
\beta
X
\beta
\beta
The predictor functions
styleX(t),t\inT
T
\beta(t),t\inT
T
dw
T
η=\alpha+\intXc(t)\beta(t)dw(t)
Xc(t)=X(t)-E(X(t))
\alpha
The outcome
Y
Y
\rm{Var}(Y\midX)=\sigma2(\mu)
\rm{E}(Y\midX)=\mu
The link function
g
\rm{E}(Y\midX)=\mu
η=\alpha+\intXc(t)\beta(t)dw(t)
\mu=g(η)
In order to implement the necessary dimension reduction, the centered predictor process
Xc(t)
\beta(t)
Xc(t)=
infty | |
\sum | |
j=1 |
\xij\rhoj(t)and\beta(t)=
infty | |
\sum | |
j=1 |
\betaj\rhoj(t),
where
\rhoj,j=1,2,\ldots
L2(dw),
\intT\rhoj(t)\rhok(t)dw(t)=\deltajk
\deltajk=1
j=k
0
The random variables
\xij
\xij=\intXc(t)\rhoj(t)dw(t)
\betaj
\betaj=\int\beta(t)\rhoj(t)dw(t)
j=1,2,\ldots
E(\xij)=0
infty | |
\sum | |
j=1 |
2 | |
\beta | |
j |
<infty
2= | |
\sigma | |
j |
Var(\xij)=
2) | |
E(\xi | |
j |
infty | |
\sum | |
j=1 |
2 | |
\sigma | |
j |
=\intE(Xc(t))2dw(t)<infty
From the orthonormality of the basis functions
\rhoj
\intXc(t)\beta(t)dw(t)=
infty | |
\sum | |
j=1 |
\betaj\xij
The key step is then approximating
η=\alpha+\intXc(t)\beta(t)dw(t)=\alpha+
infty | |
\sum | |
j=1 |
\betaj\xij
η ≈ \alpha
p | |
+\sum | |
j=1 |
\betaj\xij
p
FPCA gives the most parsimonious approximation of the linear predictor for a given number of basis functions as the eigenfunction basis explains more of the variation than any other set of basis functions.
For a differentiable link function with bounded first derivative, the approximation error of the
p
p
infty | |
Var(\sum | |
j=p+1 |
\betaj\xij)=
infty | |
E\left(\left(\sum | |
j=p+1 |
\betaj
2\right)= | |
\xi | |
j\right) |
infty | |
\sum | |
j=p+1 |
\betaj\sigma
2 | |
j |
A heuristic motivation for the truncation strategy derives from the fact that
infty | |
E\left(\left(\sum | |
j=p+1 |
\betaj
2\right) | |
\xi | |
j\right) |
=
infty | |
\sum | |
j=p+1 |
\betaj\sigma
2 | |
j |
\leq
infty | |
\sum | |
j=p+1 |
2 | |
\beta | |
j |
infty | |
\sum | |
j=p+1 |
2 | |
\sigma | |
j |
p → infty
2 | |
\sum | |
j |
2 | |
\sum | |
j |
For the special case of the eigenfunction basis, the sequence
2, | |
\sigma | |
j |
j=1,2,\ldots
G(s,t)=Cov(X(s),X(t)), s,t\inT
For data with
n
0 | |
\xi | |
j |
=1
\beta0=\alpha
i | |
\xi | |
j |
=\intXi(t)\rhoj(t)dw(t)
ηi=
p | |
\sum | |
j=0 |
\betaj
i, | |
\xi | |
j |
i=1,2,\ldots,n
\mui=g(ηi)
The main aim is to estimate the parameter function
\beta
Once
p
p
\boldsymbol
T=(\beta | |
\beta | |
0, |
\beta1,\ldots,\betap)
U(\beta)=0.
The vector valued score function turns out to be
U(\beta)=
n | |
\sum | |
i=1 |
(Yi-\mui)g'(ηi)\xii/
2(\mu | |
\sigma | |
i) |
\boldsymbol\beta
\mu
η
Just as in GLM, the equation
U(\beta)=0
\boldsymbol\hat{\beta}
\hat{\beta}(t)=\hat{\beta}o+
p | |
\sum | |
j=1 |
\hat{\beta}j\rhoj(t)
Results are available in the literature of
p
p → infty
If the response variable
Yi
Xi\inL2(T)
f(yi\midXi)=\exp\left(
yi\thetai-b(\thetai) | |
\phi |
+c(yi,\phi)\right)
for some functions
b
c
\thetai
\phi
In the canonical set up,
ηi=\alpha+\int
c(t) | |
X | |
i |
\beta(t)dw(t)=\thetai
\mui=b'(\thetai),andso\mui=b'(ηi).
Hence
b'
Var(yi)=\phib''(\thetai)=\phib''(ηi)=\phig'(ηi)=\phig'(g-1(\mui)))
\phi
Functional linear regression, one of the most useful tools of functional data analysis, is an example of GFLM where the response variable is continuous and is often assumed to have a Normal distribution. The variance function is a constant function and the link function is identity. Under these assumptions the GFLM reduces to the FLR,
\mu=\operatorname{E}(Y\midX)=η=\alpha+\intXc(t)\beta(t)dw(t)
Without the normality assumption, the constant variance function motivates the use of quasi-normal techniques.
When the response variable has binary outcomes, i.e., 0 or 1, the distribution is usually chosen as Bernoulli, and then
\mui=P(Yi=1\midXi)
\operatorname{Var}(Yi)=\phi\mui(1-\mui)
\phi
Another special case of GFLM occurs when the outcomes are counts, so that the distribution of the responses is assumed to be Poisson. The mean
\mui
ηi
\operatorname{Var}(Yi)=\phi\mui
\phi
Extensions of GFLM have been proposed for the cases where there are multiple predictor functions.[2] Another generalization is called the Semi Parametric Quasi-likelihood Regression (SPQR)[1] which considers the situation where the link and the variance functions are unknown and are estimated non-parametrically from the data. This situation can also be handled by single or multiple index models, using for example Sliced Inverse Regression (SIR).
Another extension in this domain is Functional Generalized Additive Model (FGAM))[3] which is a generalization of generalized additive model(GAM) where
g-1(\operatorname{E}(Y\midX))=\alpha+
p | |
\sum | |
j=1 |
fj(\xij),
where
\xij
X
fj
E(fj(\xij))=0.
In general, estimation in FGAM requires combining IWLS with backfitting. However, if the expansion coefficients are obtained as functional principal components, then in some cases (e.g. Gaussian predictor function
X
fj
A popular data set that has been used for a number of analysis in the domain of functional data analysis consists of the number of eggs laid daily until death of 1000 Mediterranean fruit flies (or medflies for short)http://anson.ucdavis.edu/~mueller/data/medfly1000.htmlhttp://anson.ucdavis.edu/~mueller/data/medfly1000.txt. The plot here shows the egg laying trajectories in the first 25 days of life of about 600 female medflies (those that have at least 20 remaining eggs in their lifetime). The red colored curves belong to those flies that will lay less than the median number of remaining eggs, while the blue colored curves belong to the flies that will lay more than the median number of remaining eggs after age 25. An related problem of classifying medflies as long-lived or short-lived based on the initial egg laying trajectories as predictors and the subsequent longevity of the flies as response has been studied with the GFLM[1]