Leverage (statistics) explained
In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. High-leverage points, if any, are outliers with respect to the independent variables. That is, high-leverage points have no neighboring points in
space, where
is the number of independent variables in a regression model. This makes the fitted model likely to pass close to a high leverage observation.
[1] Hence high-leverage points have the potential to cause large changes in the parameter estimates when they are deleted i.e., to be
influential points. Although an influential point will typically have high leverage, a high leverage point is not necessarily an influential point. The leverage is typically defined as the diagonal elements of the
hat matrix.
Definition and interpretations
Consider the linear regression model
{y}i=
\boldsymbol{\beta}+{\varepsilon}i
,
. That is,
\boldsymbol{y}=X\boldsymbol{\beta}+\boldsymbol{\varepsilon}
, where,
is the
design matrix whose rows correspond to the observations and whose columns correspond to the independent or explanatory variables. The
leverage score for the
independent observation
is given as:
hii=\left[H\right]ii=
\left(X\topX\right)-1\boldsymbol{x}i
, the
diagonal element of the
ortho-projection matrix (
a.k.a hat matrix)
H=X\left(X\topX\right)-1X\top
.Thus the
leverage score can be viewed as the 'weighted' distance between
to the mean of
's (see its relation with Mahalanobis distance). It can also be interpreted as the degree by which the
measured (dependent) value (i.e.,
) influences the
fitted (predicted) value (i.e.,
): mathematically,
hii=
| \partial\widehat{y |
i}{\partial |
yi}
.
Hence, the leverage score is also known as the observation self-sensitivity or self-influence.[2] Using the fact that
{\boldsymbol\widehat{y}}={H}{\boldsymboly}
(i.e., the prediction
is ortho-projection of
onto range space of
) in the above expression, we get
. Note that this leverage depends on the values of the explanatory variables
of all observations but not on any of the values of the dependent variables
.
Properties
- The leverage
is a number between 0 and 1,
Proof: Note that
is
idempotent matrix (
) and symmetric (
). Thus, by using the fact that
\left[H2\right]ii=\left[H\right]ii
, we have
. Since we know that
, we have
hii\geq
\implies0\leqhii\leq1
.
- Sum of leverages is equal to the number of parameters
in
(including the intercept).
Proof:
hii=\operatorname{Tr}(H)
=\operatorname{Tr}\left(X\left(X\topX\right)-1X\top\right)
=\operatorname{Tr}\left(X\topX\left(X\topX\right)-1\right)
=\operatorname{Tr}(Ip)=p
.
Determination of outliers in X using leverages
Large leverage
} corresponds to an
} that is extreme. A common rule is to identify
} whose leverage value
is more than 2 times larger than the mean leverage
| n |
\bar{h}=\dfrac{1}{n}\sum | |
| i=1 |
hii=\dfrac{p}{n}
(see property 2 above). That is, if
,
} shall be considered an outlier. Some statisticians prefer the threshold of
instead of
.
Relation to Mahalanobis distance
Leverage is closely related to the Mahalanobis distance (proof[3]). Specifically, for some
matrix
, the squared Mahalanobis distance of
} (where
is
row of
) from the vector of mean
| n |
\widehat{\boldsymbol{\mu}}=\sum | |
| i=1 |
\boldsymbol{x}i
of length
, is
)=(\boldsymbol{x}i-\widehat{\boldsymbol{\mu}})\topS-1(\boldsymbol{x}i-\widehat{\boldsymbol{\mu}})
, where
is the estimated covariance matrix of
}'s. This is related to the leverage
of the hat matrix of
after appending a column vector of 1's to it. The relationship between the two is:
)=(n-1)(hii-\tfrac{1}{n})
This relationship enables us to decompose leverage into meaningful components so that some sources of high leverage can be investigated analytically.[4]
Relation to influence functions
In a regression context, we combine leverage and influence functions to compute the degree to which estimated coefficients would change if we removed a single data point. Denoting the regression residuals as
\widehat{e}i=yi-
\widehat\boldsymbol{\beta}
, one can compare the estimated coefficient
\widehat\boldsymbol{\beta}
to the leave-one-out estimated coefficient
\widehat\boldsymbol{\beta}(-i)
using the formula
[5] [6] \widehat\boldsymbol{\beta}-\widehat\boldsymbol{\beta}(-i)=
| (X\topX)-1\boldsymbol{x |
i\widehat{e} |
i}{1-hii
}
Young (2019) uses a version of this formula after residualizing controls.[7] To gain intuition for this formula, note that
} = (\mathbf^\mathbf)^\boldsymbol_i captures the potential for an observation to affect the regression parameters, and therefore
(X\topX)-1\boldsymbol{x}i\widehat{e}i
captures the actual influence of that observations' deviations from its fitted value on the regression parameters. The formula then divides by
to account for the fact that we remove the observation rather than adjusting its value, reflecting the fact that removal changes the distribution of covariates more when applied to high-leverage observations (i.e. with outlier covariate values). Similar formulas arise when applying general formulas for statistical influences functions in the regression context.
[8] [9] Effect on residual variance
If we are in an ordinary least squares setting with fixed
and
homoscedastic regression errors
\boldsymbol{y}=X\boldsymbol{\beta}+\boldsymbol{\varepsilon}; \operatorname{Var}(\boldsymbol{\varepsilon})=\sigma2I
, then the
regression residual,
has variance
\operatorname{Var}(ei)=(1-hii)\sigma2
.In other words, an observation's leverage score determines the degree of noise in the model's misprediction of that observation, with higher leverage leading to less noise. This follows from the fact that
is idempotent and symmetric and
\widehat{\boldsymbol{y}}=H\boldsymbol{y}
, hence,
\operatorname{Var}(\boldsymbol{e})=\operatorname{Var}((I-H)\boldsymbol{y})
=(I-H)\operatorname{Var}(\boldsymbol{y})(I-H)\top
=\sigma2(I-H)2=\sigma2(I-H)
.
The corresponding studentized residual—the residual adjusted for its observation-specific estimated residual variance—is then
ti={ei\over\widehat{\sigma}\sqrt{1-hii }}
where
is an appropriate estimate of
.
Partial leverage
Partial leverage (PL) is a measure of the contribution of the individual independent variables to the total leverage of each observation. That is, PL is a measure of how
changes as a variable is added to the regression model. It is computed as:
where
is the index of independent variable,
is the index of observation and
are the
residuals from regressing
against the remaining independent variables. Note that the partial leverage is the leverage of the
point in the
partial regression plot for the
variable. Data points with large partial leverage for an independent variable can exert undue influence on the selection of that variable in automatic regression model building procedures.
Software implementations
Many programs and statistics packages, such as R, Python, etc., include implementations of Leverage.
See also
Notes and References
- Book: Everitt, B. S. . 2002 . Cambridge Dictionary of Statistics . Cambridge University Press . 0-521-81099-X .
- Web site: Data Assimilation: Observation influence diagnostic of a data assimilation system . C. . Cardinali . June 2013 .
- https://stats.stackexchange.com/q/200566 Prove the relation between Mahalanobis distance and Leverage?
- 2006.04024. math.ST. M. G.. Kim. Sources of high leverage in linear regression model (Journal of Applied Mathematics and Computing, Vol 16, 509–513). 2004.
- Miller. Rupert G.. September 1974. An Unbalanced Jackknife. Annals of Statistics. EN. 2. 5. 880–891. 10.1214/aos/1176342811. 0090-5364. free.
- Book: Hiyashi, Fumio. Econometrics. Princeton University Press. 2000. 21.
- Young. Alwyn. 2019. Channeling Fisher: Randomization Tests and the Statistical Insignificance of Seemingly Significant Experimental Results. The Quarterly Journal of Economics. 134. 2 . 567. 10.1093/qje/qjy029 . free.
- Chatterjee. Samprit. Hadi. Ali S.. August 1986. Influential Observations, High Leverage Points, and Outliers in Linear Regression. Statistical Science. EN. 1. 3. 379–393. 10.1214/ss/1177013622. 0883-4237. free.
- Web site: regression - Influence functions and OLS. 2020-12-06. Cross Validated.