Influential observation explained

In statistics, an influential observation is an observation for a statistical calculation whose deletion from the dataset would noticeably change the result of the calculation.[1] In particular, in regression analysis an influential observation is one whose deletion has a large effect on the parameter estimates.

Assessment

Various methods have been proposed for measuring influence.[2] [3] Assume an estimated regression

y=Xb+e

, where

y

is an n×1 column vector for the response variable,

X

is the n×k design matrix of explanatory variables (including a constant),

e

is the n×1 residual vector, and

b

is a k×1 vector of estimates of some population parameter

\beta\inRk

. Also define

H\equivX\left(XTX\right)-1XT

, the projection matrix of

X

. Then we have the following measures of influence:

DFBETAi\equivb-b(-i)=

\left(XTX\right)-1
T
x
i
ei
1-hii
, where

b(-i)

denotes the coefficients estimated with the i-th row

xi

of

X

deleted,

hii=xi\left(XTX\right)-1

T
x
i
denotes the i-th value of matrix's

H

main diagonal. Thus DFBETA measures the difference in each parameter estimate with and without the influential point. There is a DFBETA for each variable and each observation (if there are N observations and k variables there are N·k DFBETAs).[4] Table shows DFBETAs for the third dataset from Anscombe's quartet (bottom left chart in the figure):
x y intercept slope
10.0 7.46 -0.005 -0.044
8.0 6.77 -0.037 0.019
13.0 12.74 -357.910 525.268
9.0 7.11 -0.033 0
11.0 7.81 0.049 -0.117
14.0 8.84 0.490 -0.667
6.0 6.08 0.027 -0.021
4.0 5.39 0.241 -0.209
12.0 8.15 0.137 -0.231
7.0 6.42 -0.020 0.013
5.0 5.73 0.105 -0.087

Outliers, leverage and influence

An outlier may be defined as a data point that differs markedly from other observations.[5] [6] A high-leverage point are observations made at extreme values of independent variables.[7] Both types of atypical observations will force the regression line to be close to the point. In Anscombe's quartet, the bottom right image has a point with high leverage and the bottom left image has an outlying point.

See also

Further reading

Notes and References

  1. .
  2. Web site: Larry . Winner . Influence Statistics, Outliers, and Collinearity Diagnostics . March 25, 2002 .
  3. Book: Belsley . David A. . Kuh . Edwin . Welsh . Roy E. . 1980 . Regression Diagnostics: Identifying Influential Data and Sources of Collinearity . . New York . Wiley Series in Probability and Mathematical Statistics . 0-471-05856-4 . 11–16 .
  4. Web site: Outliers and DFBETA . live . May 11, 2013 . https://web.archive.org/web/20130511013229/http://www.albany.edu/faculty/kretheme/PAD705/SupportMat/DFBETA.pdf .
  5. Grubbs . F. E. . February 1969 . Procedures for detecting outlying observations in samples . Technometrics . 11 . 1 . 1–21 . 10.1080/00401706.1969.10490657. An outlying observation, or "outlier," is one that appears to deviate markedly from other members of the sample in which it occurs..
  6. Book: Maddala, G. S. . G. S. Maddala . Outliers . Introduction to Econometrics . New York . MacMillan . 2nd . 1992 . 978-0-02-374545-4 . 89 . An outlier is an observation that is far removed from the rest of the observations. . https://books.google.com/books?id=nBS3AAAAIAAJ&pg=PA89 .
  7. Book: Everitt, B. S. . 2002 . Cambridge Dictionary of Statistics . Cambridge University Press . 0-521-81099-X .