Sliced inverse regression explained
Sliced inverse regression (SIR) is a tool for dimensionality reduction in the field of multivariate statistics.[1]
In statistics, regression analysis is a method of studying the relationship between a response variable y and its input variable
, which is a
p-dimensional vector. There are several approaches in the category of regression. For example, parametric methods include multiple linear regression, and non-parametric methods include local smoothing.
As the number of observations needed to use local smoothing methods scales exponentially with high-dimensional data (as p grows), reducing the number of dimensions can make the operation computable. Dimensionality reduction aims to achieve this by showing only the most important dimension of the data. SIR uses the inverse regression curve,
, to perform a weighted principal component analysis.
Model
Given a response variable
and a (random) vector
of explanatory variables,
SIR is based on the model
where
are unknown projection vectors,
is an unknown number smaller than
,
is an unknown function on
as it only depends on
arguments, and
is a random variable representing error with
and a finite variance of
. The model describes an ideal solution, where
depends on
only through a
dimensional subspace; i.e., one can reduce the dimension of the explanatory variables from
to a smaller number
without losing any information.
An equivalent version of
is: the conditional distribution of
given
depends on
only through the
dimensional random vector
. It is assumed that this reduced vector is as informative as the original
in explaining
.
The unknown
are called the
effective dimension reducing directions (EDR-directions). The space that is spanned by these vectors is denoted by the
effective dimension reducing space (EDR-space).
Relevant linear algebra background
Given
\underline{a}1,\ldots,\underline{a}r\in\Rn
, then
V:=L(\underline{a}1,\ldots,\underline{a}r)
, the set of all linear combinations of these vectors is called a linear subspace and is therefore a vector space. The equation says that vectors
\underline{a}1,\ldots,\underline{a}r
span
, but the vectors that span space
are not unique.
The dimension of
is equal to the maximum number of linearly independent vectors in
. A set of
linear independent vectors of
makes up a basis of
. The dimension of a vector space is unique, but the basis itself is not. Several bases can span the same space. Dependent vectors can still span a space, but the linear combinations of the latter are only suitable to a set of vectors lying on a straight line.
Inverse regression
Computing the inverse regression curve (IR) means instead of looking for
, which is a curve in
it is actually
, which is also a curve in
, but consisting of
one-dimensional regressions.
The center of the inverse regression curve is located at
. Therefore, the centered inverse regression curve is
which is a
dimensional curve in
.
Inverse regression versus dimension reduction
The centered inverse regression curve lies on a
-dimensional subspace spanned by
. This is a connection between the model and inverse regression.
Given this condition and
, the centered inverse regression curve
is contained in the linear subspace spanned by
\Sigmaxx\betak(k=1,\ldots,K)
, where
.
Estimation of the EDR-directions
After having had a look at all the theoretical properties, the aim now is to estimate the EDR-directions. For that purpose, weighted principal component analyses are needed. If the sample means
,
would have been standardized to
. Corresponding to the theorem above, the IR-curve
lies in the space spanned by
, where
. As a consequence, the covariance matrix
is degenerate in any direction orthogonal to the
. Therefore, the eigenvectors
associated with the largest
eigenvalues are the standardized EDR-directions.
Algorithm
The algorithm to estimate the EDR-directions via SIR is as follows.
1. Let
be the covariance matrix of
. Standardize
to
(
can also be rewritten as
where
ηk=\betak\Sigma
\forall k
.)
2. Divide the range of
into
non-overlapping slices
is the number of observations within each slice and
is the indicator function for the slice:
3. Compute the mean of
over all slices, which is a crude estimate
of the inverse regression curve
:
4. Calculate the estimate for
:
5. Identify the eigenvalues
and the eigenvectors
of
, which are the standardized EDR-directions.
6. Transform the standardized EDR-directions back to the original scale. The estimates for the EDR-directions are given by:
\hat{\beta}i=\hat{\Sigma}
\hat{η}i
(which are not necessarily orthogonal)
References
- Li, K-C. (1991) "Sliced Inverse Regression for Dimension Reduction", Journal of the American Statistical Association, 86, 316 - 327 Jstor
- Cook, R.D. and Sanford Weisberg, S. (1991) "Sliced Inverse Regression for Dimension Reduction: Comment", Journal of the American Statistical Association, 86, 328 - 332 Jstor
- Härdle, W. and Simar, L. (2003) Applied Multivariate Statistical Analysis, Springer Verlag.
- Kurzfassung zur Vorlesung Mathematik II im Sommersemester 2005, A. Brandt
Notes and References
- Ker-Chau Li . Li . Ker-Chau . 1991 . Sliced Inverse Regression for Dimension Reduction . Journal of the American Statistical Association . 86 . 414 . 316–327 . 10.2307/2290563 . 0162-1459.