In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.
In PCR, instead of regressing the dependent variable on the explanatory variables directly, the principal components of the explanatory variables are used as regressors. One typically uses only a subset of all the principal components for regression, making PCR a kind of regularized procedure and also a type of shrinkage estimator.
Often the principal components with higher variances (the ones based on eigenvectors corresponding to the higher eigenvalues of the sample variance-covariance matrix of the explanatory variables) are selected as regressors. However, for the purpose of predicting the outcome, the principal components with low variances may also be important, in some cases even more important.[1]
One major use of PCR lies in overcoming the multicollinearity problem which arises when two or more of the explanatory variables are close to being collinear.[2] PCR can aptly deal with such situations by excluding some of the low-variance principal components in the regression step. In addition, by usually regressing on only a subset of all the principal components, PCR can result in dimension reduction through substantially lowering the effective number of parameters characterizing the underlying model. This can be particularly useful in settings with high-dimensional covariates. Also, through appropriate selection of the principal components to be used for regression, PCR can lead to efficient prediction of the outcome based on the assumed model.
The PCR method may be broadly divided into three major steps:
1.
2.
3.
Data representation: Let
Yn=\left(y1,\ldots,y
T | |
n\right) |
Xn=\left(x1,\ldots,x
T | |
n\right) |
n
p
n\geqp
n
X
p
Y
Data pre-processing: Assume that
Y
p
X
X
X
Underlying model: Following centering, the standard Gauss–Markov linear regression model for
Y
X
Y=X\boldsymbol{\beta}+\boldsymbol{\varepsilon},
\boldsymbol{\beta}\inRp
\boldsymbol{\varepsilon}
\operatorname{E}\left(\boldsymbol{\varepsilon}\right)=0
\operatorname{Var}\left(\boldsymbol{\varepsilon}\right)=
2I | |
\sigma | |
n x n |
\sigma2>0
\widehat{\boldsymbol\beta}
\boldsymbol\beta
X
\widehat{\boldsymbol\beta}ols=(XTX)-1XTY
\boldsymbol{\beta}
\boldsymbol{\beta}
PCA step: PCR starts by performing a PCA on the centered data matrix
X
X=U\DeltaVT
X
\Deltap=\operatorname{diag}\left[\delta1,\ldots,\deltap\right]
\delta1\geq … \geq\deltap\geq0
X
Un=[u1,\ldots,up]
Vp=[v1,\ldots,vp]
X
The principal components:
VΛVT
XTX
Λp=\operatorname{diag}\left[λ1,\ldots,λp\right]=
2\right] | |
\operatorname{diag}\left[\delta | |
p |
=\Delta2
λ1\geq … \geqλp\geq0
XTX
V
Xvj
vj
jth
jth
jth
λj
j\in\{1,\ldots,p\}
Derived covariates: For any
k\in\{1,\ldots,p\}
Vk
p x k
k
V
Wk=XVk
=[Xv1,\ldots,Xvk]
n x k
k
W
k | |
x | |
i |
=
T | |
V | |
k |
xi\inRk
xi\inRp \forall 1\leqi\leqn
The PCR estimator: Let
\widehat{\gamma}k=
T | |
(W | |
k |
-1 | |
W | |
k) |
T | |
W | |
k |
Y\inRk
Y
Wk
k\in\{1,\ldots,p\}
\boldsymbol{\beta}
k
\widehat{\boldsymbol{\beta}}k=Vk\widehat{\gamma}k\inRp
The fitting process for obtaining the PCR estimator involves regressing the response vector on the derived data matrix
Wk
k\in\{1,\ldots,p\}
k
k
k
When all the principal components are selected for regression so that
k=p
\widehat{\boldsymbol{\beta}}p=\widehat{\boldsymbol{\beta}}ols
Wp=XVp=XV
V
For any
k\in\{1,\ldots,p\}
\widehat{\boldsymbol{\beta}}k
\operatorname{Var}(\widehat{\boldsymbol{\beta}}k)=\sigma2 Vk
T | |
(W | |
k |
-1 | |
W | |
k) |
T | |
V | |
k |
=\sigma2 Vk
-1 | |
\operatorname{diag}\left(λ | |
1 |
-1 | |
,\ldots,λ | |
k |
\right)
T | |
V | |
k |
=\sigma2
k | |
\sideset{}{}\sum | |
j=1 |
| |||||||||||||
λj |
.
In particular:
\operatorname{Var}(\widehat{\boldsymbol{\beta}}p)=\operatorname{Var}(\widehat{\boldsymbol{\beta}}ols)=\sigma2
p | |
\sideset{}{}\sum | |
j=1 |
| |||||||||||||
λj |
.
Hence for all
k\in\{1,\ldots,p-1\}
\operatorname{Var}(\widehat{\boldsymbol{\beta}}ols)-\operatorname{Var}(\widehat{\boldsymbol{\beta}}k)=\sigma2
| ||||||||||||||||
\sideset{}{}\sum | ||||||||||||||||
j=k+1 |
.
Thus, for all
k\in\{1,\ldots,p\}
\operatorname{Var}(\widehat{\boldsymbol{\beta}}ols)-\operatorname{Var}(\widehat{\boldsymbol{\beta}}k)\succeq0
where
A\succeq0
A
Under multicollinearity, two or more of the covariates are highly correlated, so that one can be linearly predicted from the others with a non-trivial degree of accuracy. Consequently, the columns of the data matrix
X
X
XTX
0
0
PCR may also be used for performing dimension reduction. To see this, let
Lk
p x k
k\in\{1,\ldots,p\}.
xi
k
Lkzi
zi\inRk(1\leqi\leqn)
Then, it can be shown that
n | |
\sum | |
i=1 |
\left\|xi-Lkzi\right\|2
is minimized at
Lk=Vk,
k
zi=
k | |
x | |
i |
=
T | |
V | |
k |
xi,
k
k
k
X
The corresponding reconstruction error is given by:
n | |
\sum | |
i=1 |
\left\|xi-Vk
k | |
x | |
i |
\right\|2=\begin{cases}
n | |
\sum | |
j=k+1 |
λj&1\leqslantk<p\ 0&k=p\end{cases}
Thus any potential dimension reduction may be achieved by choosing
k
XTX
Since the PCR estimator typically uses only a subset of all the principal components for regression, it can be viewed as some sort of a regularized procedure. More specifically, for any
1\leqslantk<p
\widehat{\boldsymbol{\beta}}k
min\boldsymbol{\beta*\inRp
The constraint may be equivalently written as:
T | |
V | |
(p-k) |
\boldsymbol{\beta}*=0,
where:
V(p-k)=\left[vk+1,\ldots,vp\right]p x .
Thus, when only a proper subset of all the principal components are selected for regression, the PCR estimator so obtained is based on a hard form of regularization that constrains the resulting solution to the column space of the selected principal component directions, and consequently restricts it to be orthogonal to the excluded directions.
Given the constrained minimization problem as defined above, consider the following generalized version of it:
min\boldsymbol{\beta*\inRp
where,
L(p-k)
p x (p-k)
1\leqslantk<p
Let
\widehat{\boldsymbol{\beta}}L
\widehat{\boldsymbol{\beta}}L=\argmin\boldsymbol{\beta*\inRp
Then the optimal choice of the restriction matrix
L(p-k)
\widehat{\boldsymbol{\beta}}L
* | |
L | |
(p-k) |
=V(p-k)
1/2 | |
Λ | |
(p-k) |
,
where
1/2 | |
Λ | |
(p-k) |
=\operatorname{diag}
1/2 | |
\left(λ | |
k+1 |
1/2 | |
,\ldots,λ | |
p |
\right).
Quite clearly, the resulting optimal estimator
\widehat{\boldsymbol{\beta}} | |
L* |
\widehat{\boldsymbol{\beta}}k
k
Since the ordinary least squares estimator is unbiased for
\boldsymbol{\beta}
\operatorname{Var}(\widehat{\boldsymbol{\beta}}ols)=\operatorname{MSE}(\widehat{\boldsymbol{\beta}}ols),
where, MSE denotes the mean squared error. Now, if for some
k\in\{1,\ldots,p\}
T\boldsymbol{\beta} | |
V | |
(p-k) |
=0
\widehat{\boldsymbol{\beta}}k
\boldsymbol{\beta}
\operatorname{Var}(\widehat{\boldsymbol{\beta}}k)=\operatorname{MSE}(\widehat{\boldsymbol{\beta}}k).
We have already seen that
\forallj\in\{1,\ldots,p\}: \operatorname{Var}(\widehat{\boldsymbol{\beta}}ols)-\operatorname{Var}(\widehat{\boldsymbol{\beta}}j)\succeq0,
which then implies:
\operatorname{MSE}(\widehat{\boldsymbol{\beta}}ols)-\operatorname{MSE}(\widehat{\boldsymbol{\beta}}k)\succeq0
for that particular
k
\widehat{\boldsymbol{\beta}}k
\boldsymbol{\beta}
\widehat{\boldsymbol{\beta}}ols
\widehat{\boldsymbol{\beta}}k
\widehat{\boldsymbol{\beta}}ols
Now suppose that for a given
k\in\{1,\ldots,p\},
T{\boldsymbol{\beta}} | |
V | |
(p-k) |
≠ 0
\widehat{\boldsymbol{\beta}}k
\boldsymbol{\beta}
\forallk\in\{1,\ldots,p\}: \operatorname{Var}(\widehat{\boldsymbol{\beta}}ols)-\operatorname{Var}(\widehat{\boldsymbol{\beta}}k)\succeq0,
it is still possible that
\operatorname{MSE}(\widehat{\boldsymbol{\beta}}ols)-\operatorname{MSE}(\widehat{\boldsymbol{\beta}}k)\succeq0
k
In order to ensure efficient estimation and prediction performance of PCR as an estimator of
\boldsymbol{\beta}
jth
λj<(p\sigma2)/\boldsymbol{\beta}T\boldsymbol{\beta}.
\sigma2
\boldsymbol{\beta}
Unlike the criteria based on the cumulative sum of the eigenvalues of
XTX
In general, PCR is essentially a shrinkage estimator that usually retains the high variance principal components (corresponding to the higher eigenvalues of
XTX
XTX
In addition, the principal components are obtained from the eigen-decomposition of
X
2006 a variant of the classical PCR known as the supervised PCR was proposed.[5] In a spirit similar to that of PLS, it attempts at obtaining derived covariates of lower dimensions based on a criterion that involves both the outcome as well as the covariates. The method starts by performing a set of
p
p
m\in\{1,\ldots,p\}
m
n x m
m\in\{1,\ldots,p\}
k\in\{1,\ldots,m\}
The classical PCR method as described above is based on classical PCA and considers a linear regression model for predicting the outcome based on the covariates. However, it can be easily generalized to a kernel machine setting whereby the regression function need not necessarily be linear in the covariates, but instead it can belong to the Reproducing Kernel Hilbert Space associated with any arbitrary (possibly non-linear), symmetric positive-definite kernel. The linear regression model turns out to be a special case of this setting when the kernel function is chosen to be the linear kernel.
In general, under the kernel machine setting, the vector of covariates is first mapped into a high-dimensional (potentially infinite-dimensional) feature space characterized by the kernel function chosen. The mapping so obtained is known as the feature map and each of its coordinates, also known as the feature elements, corresponds to one feature (may be linear or non-linear) of the covariates. The regression function is then assumed to be a linear combination of these feature elements. Thus, the underlying regression model in the kernel machine setting is essentially a linear regression model with the understanding that instead of the original set of covariates, the predictors are now given by the vector (potentially infinite-dimensional) of feature elements obtained by transforming the actual covariates using the feature map.
However, the kernel trick actually enables us to operate in the feature space without ever explicitly computing the feature map. It turns out that it is only sufficient to compute the pairwise inner products among the feature maps for the observed covariate vectors and these inner products are simply given by the values of the kernel function evaluated at the corresponding pairs of covariate vectors. The pairwise inner products so obtained may therefore be represented in the form of a
n x n
PCR in the kernel machine setting can now be implemented by first appropriately centering this kernel matrix (K, say) with respect to the feature space and then performing a kernel PCA on the centered kernel matrix (K', say) whereby an eigendecomposition of K' is obtained. Kernel PCR then proceeds by (usually) selecting a subset of all the eigenvectors so obtained and then performing a standard linear regression of the outcome vector on these selected eigenvectors. The eigenvectors to be used for regression are usually selected using cross-validation. The estimated regression coefficients (having the same dimension as the number of selected eigenvectors) along with the corresponding selected eigenvectors are then used for predicting the outcome for a future observation. In machine learning, this technique is also known as spectral regression.
Clearly, kernel PCR has a discrete shrinkage effect on the eigenvectors of K', quite similar to the discrete shrinkage effect of classical PCR on the principal components, as discussed earlier. However, the feature map associated with the chosen kernel could potentially be infinite-dimensional, and hence the corresponding principal components and principal component directions could be infinite-dimensional as well. Therefore, these quantities are often practically intractable under the kernel machine setting. Kernel PCR essentially works around this problem by considering an equivalent dual formulation based on using the spectral decomposition of the associated kernel matrix. Under the linear regression model (which corresponds to choosing the kernel function as the linear kernel), this amounts to considering a spectral decomposition of the corresponding
n x n
XXT
XXT