In statistics, ordinary least squares (OLS) is a type of linear least squares method for choosing the unknown parameters in a linear regression model (with fixed level-one effects of a linear function of a set of explanatory variables) by the principle of least squares: minimizing the sum of the squares of the differences between the observed dependent variable (values of the variable being observed) in the input dataset and the output of the (linear) function of the independent variable. Some sources consider OLS to be linear regression.[1]
Geometrically, this is seen as the sum of the squared distances, parallel to the axis of the dependent variable, between each data point in the set and the corresponding point on the regression surface—the smaller the differences, the better the model fits the data. The resulting estimator can be expressed by a simple formula, especially in the case of a simple linear regression, in which there is a single regressor on the right side of the regression equation.
The OLS estimator is consistent for the level-one fixed effects when the regressors are exogenous and forms perfect colinearity (rank condition), consistent for the variance estimate of the residuals when regressors have finite fourth moments[2] and—by the Gauss–Markov theorem—optimal in the class of linear unbiased estimators when the errors are homoscedastic and serially uncorrelated. Under these conditions, the method of OLS provides minimum-variance mean-unbiased estimation when the errors have finite variances. Under the additional assumption that the errors are normally distributed with zero mean, OLS is the maximum likelihood estimator that outperforms any non-linear unbiased estimator.
See main article: Linear regression model. Suppose the data consists of
n
\left\{xi,yi\right\}
n | |
i=1 |
i
yi
xi
p
xi=\left[xi1,xi2,...,xip\right]\operatorname{T}
yi
yi=\beta1 xi1+\beta2 xi2+ … +\betap xip+\varepsiloni,
or in vector form,
yi=
\operatorname{T} | |
x | |
i |
\boldsymbol{\beta}+\varepsiloni,
xi
i
\boldsymbol{\beta}
p x 1
\varepsiloni
i
\varepsiloni
yi
xi
y=X\boldsymbol{\beta}+\boldsymbol{\varepsilon},
where
y
\boldsymbol{\varepsilon}
n x 1
n
X
n x p
i
\operatorname{T} | |
x | |
i |
i
Typically, a constant term is included in the set of regressors
X
xi1=1
i=1,...,n
\beta1
xi=\vec{0}
Regressors do not have to be independent for estimation to be consistent, but multicolinearity makes estimation inconsistent. As a concrete example where regressors are not independent, we might suspect the response depends linearly both on a value and its square; in which case we would include one regressor whose value is just the square of another regressor. In that case, the model would be quadratic in the second regressor, but none-the-less is still considered a linear model because the model is still linear in the parameters (
\boldsymbol{\beta}
Consider an overdetermined system
p | |
\sum | |
j=1 |
xij\betaj=yi, (i=1,2,...,n),
of
n
p
\beta1,\beta2,...,\betap
n>p
X\boldsymbol{\beta}=y,
where
X=\begin{bmatrix} X11&X12& … &X1p\\ X21&X22& … &X2p\\ \vdots&\vdots&\ddots&\vdots\\ Xn1&Xn2& … &Xnp\end{bmatrix}, \boldsymbol\beta=\begin{bmatrix} \beta1\ \beta2\ \vdots\ \betap \end{bmatrix}, y=\begin{bmatrix} y1\ y2\ \vdots\ yn \end{bmatrix}.
(Note: for a linear model as above, not all elements in
X
Xi1=1
p
Such a system usually has no exact solution, so the goal is instead to find the coefficients
\boldsymbol{\beta}
\hat{\boldsymbol{\beta}}=\underset{\boldsymbol{\beta}}{\operatorname{argmin}}S(\boldsymbol{\beta}),
where the objective function
S
S(\boldsymbol{\beta})=
n | |
\sum | |
i=1 |
\left|yi-
p | |
\sum | |
j=1 |
Xij
2 | |
\beta | |
j\right| |
=\left\|y-X\boldsymbol\beta\right\|2.
A justification for choosing this criterion is given in Properties below. This minimization problem has a unique solution, provided that the
p
X
\left(X\operatorname{T
The matrix
X\operatorname{T
X\operatorname{T
\hat{\boldsymbol{\beta}}
\hat{\boldsymbol{\beta}}=\left(X\operatorname{T
or
\hat{\boldsymbol{\beta}}=\boldsymbol{\beta}+\left(X\operatorname{T}X\right)-1X\operatorname{T}\boldsymbol{\varepsilon}.
Suppose b is a "candidate" value for the parameter vector β. The quantity, called the residual for the i-th observation, measures the vertical distance between the data point and the hyperplane, and thus assesses the degree of fit between the actual data and the model. The sum of squared residuals (SSR) (also called the error sum of squares (ESS) or residual sum of squares (RSS))[4] is a measure of the overall model fit:
S(b)=
n | |
\sum | |
i=1 |
(yi-
\operatorname{T} | |
x | |
i |
b)2=(y-Xb)\operatorname{T}(y-Xb),
b=\hat\beta
\hat\beta=
\operatorname{argmin} | |
b\inRp |
S(b)=(X\operatorname{T}X)-1X\operatorname{T}y .
The product N=XT X is a Gram matrix and its inverse, Q=N–1, is the cofactor matrix of β,[5] [6] [7] closely related to its covariance matrix, Cβ.The matrix (XT X)–1 XT=Q XT is called the Moore–Penrose pseudoinverse matrix of X. This formulation highlights the point that estimation can be carried out if, and only if, there is no perfect multicollinearity between the explanatory variables (which would cause the gram matrix to have no inverse).
After we have estimated β, the fitted values (or predicted values) from the regression will be
\hat{y}=X\hat\beta=Py,
\hat\varepsilon=y-\haty=y-X\hat\beta=My=M(X\beta+\varepsilon)=(MX)\beta+M\varepsilon=M\varepsilon.
Using these residuals we can estimate the value of σ 2 using the reduced chi-squared statistic:
s2=
\hat\varepsilonT\hat\varepsilon | |
n-p |
=
(My)TMy | |
n-p |
=
yTMTMy | |
n-p |
=
yTMy | |
n-p |
=
S(\hat\beta) | |
n-p |
, \hat\sigma2=
n-p | |
n |
s2
It is common to assess the goodness-of-fit of the OLS regression by comparing how much the initial variation in the sample can be reduced by regressing onto X. The coefficient of determination R2 is defined as a ratio of "explained" variance to the "total" variance of the dependent variable y, in the cases where the regression sum of squares equals the sum of squares of residuals:
R2=
\sum(\hatyi-\overline{y | |
) |
2} | |
i-\overline{y}) |
=
yTPTLPy | |
yTLy |
=1-
yTMy | |
yTLy |
=1-
\rmRSS | |
\rmTSS |
L
The variance in the prediction of the independent variable as a function of the dependent variable is given in the article Polynomial least squares.
See main article: Simple linear regression. If the data matrix X contains only two variables, a constant and a scalar regressor xi, then this is called the "simple regression model". This case is often considered in the beginner statistics classes, as it provides much simpler formulas even suitable for manual calculation. The parameters are commonly denoted as :
yi=\alpha+\betaxi+\varepsiloni.
\begin{align} \widehat\beta&=
| |||||||
)(y |
i-\bar{y})}}{
2}} | |
\sum | |
i-\bar{x}) |
\\[2pt] \widehat\alpha&=\bar{y}-\widehat\beta\bar{x} , \end{align}
In the previous section the least squares estimator
\hat\beta
For mathematicians, OLS is an approximate solution to an overdetermined system of linear equations, where β is the unknown. Assuming the system cannot be solved exactly (the number of equations n is much larger than the number of unknowns p), we are looking for a solution that could provide the smallest discrepancy between the right- and left- hand sides. In other words, we are looking for the solution that satisfies
\hat\beta={\rmarg}min\beta\lVerty-X\boldsymbol\beta\rVert,
In other words, the gradient equations at the minimum can be written as:
(y-X\hat{\boldsymbol{\beta}})\topX=0.
A geometrical interpretation of these equations is that the vector of residuals,
y-X\hat{\boldsymbol{\beta}}
(y-X\hat{\boldsymbol{\beta}}) ⋅ Xv
y-X\boldsymbol{\hat\beta}
y-X\boldsymbol\beta
Introducing
\hat{\boldsymbol{\gamma}}
[X K]
\hat{r
\begin{align} y&=\begin{bmatrix}X&K\end{bmatrix}\begin{bmatrix}\hat{\boldsymbol{\beta}}\ \hat{\boldsymbol{\gamma}}\end{bmatrix},\\ {} ⇒ \begin{bmatrix}\hat{\boldsymbol{\beta}}\ \hat{\boldsymbol{\gamma}}\end{bmatrix}&=\begin{bmatrix}X&K\end{bmatrix}-1y=\begin{bmatrix}\left(X\topX\right)-1X\top\ \left(K\topK\right)-1K\top\end{bmatrix}y. \end{align}
Another way of looking at it is to consider the regression line to be a weighted average of the lines passing through the combination of any two points in the dataset.[11] Although this way of calculation is more computationally expensive, it provides a better intuition on OLS.
The OLS estimator is identical to the maximum likelihood estimator (MLE) under the normality assumption for the error terms.[proof] This normality assumption has historical importance, as it provided the basis for the early work in linear regression analysis by Yule and Pearson. From the properties of MLE, we can infer that the OLS estimator is asymptotically efficient (in the sense of attaining the Cramér–Rao bound for variance) if the normality assumption is satisfied.
In iid case the OLS estimator can also be viewed as a GMM estimator arising from the moment conditions
E[xi\left(yi-
\operatorname{T} | |
x | |
i |
\beta\right)]=0.
Note that the original strict exogeneity assumption implies a far richer set of moment conditions than stated above. In particular, this assumption implies that for any vector-function, the moment condition will hold. However it can be shown using the Gauss–Markov theorem that the optimal choice of function is to take, which results in the moment equation posted above.
There are several different frameworks in which the linear regression model can be cast in order to make the OLS technique applicable. Each of these settings produces the same formulas and same results. The only difference is the interpretation and the assumptions which have to be imposed in order for the method to give meaningful results. The choice of the applicable framework depends mostly on the nature of data in hand, and on the inference task which has to be performed.
One of the lines of difference in interpretation is whether to treat the regressors as random variables, or as predefined constants. In the first case (random design) the regressors xi are random and sampled together with the yis from some population, as in an observational study. This approach allows for more natural study of the asymptotic properties of the estimators. In the other interpretation (fixed design), the regressors X are treated as known constants set by a design, and y is sampled conditionally on the values of X as in an experiment. For practical purposes, this distinction is often unimportant, since estimation and inference is carried out while conditioning on X. All results stated in this article are within the random design framework.
The classical model focuses on the "finite sample" estimation and inference, meaning that the number of observations n is fixed. This contrasts with the other approaches, which study the asymptotic behavior of OLS, and in which the behavior at a large number of samples is studied.
In some applications, especially with cross-sectional data, an additional assumption is imposed — that all observations are independent and identically distributed. This means that all observations are taken from a random sample which makes all the assumptions listed earlier simpler and easier to interpret. Also this framework allows one to state asymptotic results (as the sample size), which are understood as a theoretical possibility of fetching new independent observations from the data generating process. The list of assumptions in this case is:
First of all, under the strict exogeneity assumption the OLS estimators and s2 are unbiased, meaning that their expected values coincide with the true values of the parameters:[proof]
\operatorname{E}[\hat\beta\midX]=\beta, \operatorname{E}[s2\midX]=\sigma2.
The variance-covariance matrix (or simply covariance matrix) of is equal to
\operatorname{Var}[\hat\beta\midX]=\sigma2\left(X\operatorname{T}X\right)-1=\sigma2Q.
\widehat{\operatorname{s.e.}}(\hat{\beta}j)=\sqrt{s2\left(X\operatorname{T}
-1 | |
X\right) | |
jj |
It can also be easily shown that the estimator is uncorrelated with the residuals from the model:
\operatorname{Cov}[\hat\beta,\hat\varepsilon\midX]=0.
The Gauss–Markov theorem states that under the spherical errors assumption (that is, the errors should be uncorrelated and homoscedastic) the estimator is efficient in the class of linear unbiased estimators. This is called the best linear unbiased estimator (BLUE). Efficiency should be understood as if we were to find some other estimator which would be linear in y and unbiased, then
\operatorname{Var}[\tilde\beta\midX]-\operatorname{Var}[\hat\beta\midX]\geq0
The properties listed so far are all valid regardless of the underlying distribution of the error terms. However, if you are willing to assume that the normality assumption holds (that is, that), then additional properties of the OLS estimators can be stated.
The estimator is normally distributed, with mean and variance as given before:[14]
\hat\beta \sim l{N}(\beta, \sigma2(XTX)-1).
The estimator s2 will be proportional to the chi-squared distribution:
| ||||
s |
⋅
2 | |
\chi | |
n-p |
Moreover, the estimators and s2 are independent, the fact which comes in useful when constructing the t- and F-tests for the regression.
See main article: Influential observation.
See also: Leverage (statistics).
As was mentioned before, the estimator
\hat\beta
To analyze which observations are influential we remove a specific j-th observation and consider how much the estimated quantities are going to change (similarly to the jackknife method). It can be shown that the change in the OLS estimator for β will be equal to [16]
\hat\beta(j)-\hat\beta=-
1 | |
1-hj |
(XTX)-1
T | |
x | |
j |
\hat\varepsilonj,
(j) | |
\hat{y} | |
j |
-\hat{y}j=
T | |
x | |
j |
\hat\beta(j)-
\operatorname{T} | |
x | |
j |
\hat\beta=-
hj | |
1-hj |
\hat\varepsilonj
From the properties of the hat matrix,, and they sum up to p, so that on average . These quantities hj are called the leverages, and observations with high hj are called leverage points. Usually the observations with high leverage ought to be scrutinized more carefully, in case they are erroneous, or outliers, or in some other way atypical of the rest of the dataset.
Sometimes the variables and corresponding parameters in the regression can be logically split into two groups, so that the regression takes form
y=X1\beta1+X2\beta2+\varepsilon,
The Frisch–Waugh–Lovell theorem states that in this regression the residuals and the OLS estimate will be numerically identical to the residuals and the OLS estimate for β2 in the following regression:
M1y=M1X2\beta2+η,
The theorem can be used to establish a number of theoretical results. For example, having a regression with a constant and another regressor is equivalent to subtracting the means from the dependent variable and the regressor and then running the regression for the de-meaned variables but without the constant term.
See main article: Ridge regression.
Suppose it is known that the coefficients in the regression satisfy a system of linear equations
A\colon Q\operatorname{T}\beta=c,
\hat\betac=\hat\beta-(X\operatorname{T}X)-1Q(Q\operatorname{T}(X\operatorname{T}X)-1Q)-1(Q\operatorname{T}\hat\beta-c).
This expression for the constrained estimator is valid as long as the matrix XTX is invertible. It was assumed from the beginning of this article that this matrix is of full rank, and it was noted that when the rank condition fails, β will not be identifiable. However it may happen that adding the restriction A makes β identifiable, in which case one would like to find the formula for the estimator. The estimator is equal to
\hat\betac=R(R\operatorname{T}X\operatorname{T}XR)-1R\operatorname{T}X\operatorname{T}y+(Ip-R(R\operatorname{T}X\operatorname{T}XR)-1R\operatorname{T}X\operatorname{T}X)Q(Q\operatorname{T}Q)-1c,
The least squares estimators are point estimates of the linear regression model parameters β. However, generally we also want to know how close those estimates might be to the true values of parameters. In other words, we want to construct the interval estimates.
Since we have not made any assumption about the distribution of error term εi, it is impossible to infer the distribution of the estimators
\hat\beta
\hat\sigma2
We can show that under the model assumptions, the least squares estimator for β is consistent (that is
\hat\beta
(\hat\beta-
-1 | |
\beta) \xrightarrow{d} l{N}(0, \sigma | |
xx |
),
Qxx=X\operatorname{T}X.
See main article: Confidence interval and Prediction interval.
Using this asymptotic distribution, approximate two-sided confidence intervals for the j-th component of the vector
\hat{\beta}
\betaj\in[ \hat\betaj\pmql{N(0,
1)} | ||||||
|
\sqrt{\hat{\sigma}2
-1 | |
\left[Q | |
xx |
\right]jj
Similarly, the least squares estimator for σ2 is also consistent and asymptotically normal (provided that the fourth moment of εi exists) with limiting distribution
(\hat{\sigma}2-
4\right] | |
\sigma | |
i |
-\sigma4\right).
These asymptotic distributions can be used for prediction, testing hypotheses, constructing other estimators, etc.. As an example consider the problem of prediction. Suppose
x0
y0=
T | |
x | |
0 |
\beta
\hat{y}0=
T | |
x | |
0 |
\hat\beta
\hat{\beta}
\left(\hat{y}0-
2 | |
y | |
0\right) \xrightarrow{d} l{N}\left(0, \sigma |
T | |
x | |
0 |
-1 | |
Q | |
xx |
x0\right),
which allows construct confidence intervals for mean response
y0
y0\in
T | |
\left[ x | |
0 |
\hat{\beta}\pmql{N(0,
1)} | ||||||
|
\sqrt{\hat\sigma2
T | |
x | |
0 |
-1 | |
Q | |
xx |
x0} \right]
See main article: Hypothesis testing.
Two hypothesis tests are particularly widely used. First, one wants to know if the estimated regression equation is any better than simply predicting that all values of the response variable equal its sample mean (if not, it is said to have no explanatory power). The null hypothesis of no explanatory value of the estimated regression is tested using an F-test. If the calculated F-value is found to be large enough to exceed its critical value for the pre-chosen level of significance, the null hypothesis is rejected and the alternative hypothesis, that the regression has explanatory power, is accepted. Otherwise, the null hypothesis of no explanatory power is accepted.
Second, for each explanatory variable of interest, one wants to know whether its estimated coefficient differs significantly from zero—that is, whether this particular explanatory variable in fact has explanatory power in predicting the response variable. Here the null hypothesis is that the true coefficient is zero. This hypothesis is tested by computing the coefficient's t-statistic, as the ratio of the coefficient estimate to its standard error. If the t-statistic is larger than a predetermined value, the null hypothesis is rejected and the variable is found to have explanatory power, with its coefficient significantly different from zero. Otherwise, the null hypothesis of a zero value of the true coefficient is accepted.
In addition, the Chow test is used to test whether two subsamples both have the same underlying true coefficient values. The sum of squared residuals of regressions on each of the subsets and on the combined data set are compared by computing an F-statistic; if this exceeds a critical value, the null hypothesis of no difference between the two subsets is rejected; otherwise, it is accepted.
The following data set gives average heights and weights for American women aged 30–39 (source: The World Almanac and Book of Facts, 1975).
Height (m) | 1.47 | 1.50 | 1.52 | 1.55 | 1.57 | |
---|---|---|---|---|---|---|
Weight (kg) | 52.21 | 53.12 | 54.48 | 55.84 | 57.20 | |
Height (m) | 1.60 | 1.63 | 1.65 | 1.68 | 1.70 | |
Weight (kg) | 58.57 | 59.93 | 61.29 | 63.11 | 64.47 | |
Height (m) | 1.73 | 1.75 | 1.78 | 1.80 | 1.83 | |
Weight (kg) | 66.28 | 68.10 | 69.92 | 72.19 | 74.46 |
When only one dependent variable is being modeled, a scatterplot will suggest the form and strength of the relationship between the dependent variable and regressors. It might also reveal outliers, heteroscedasticity, and other aspects of the data that may complicate the interpretation of a fitted regression model. The scatterplot suggests that the relationship is strong and can be approximated as a quadratic function. OLS can handle non-linear relationships by introducing the regressor 2. The regression model then becomes a multiple linear model:
wi=\beta1+\beta2hi+\beta3
2 | |
h | |
i |
+\varepsiloni.
The output from most popular statistical packages will look similar to this:
Method | Least squares | ||||
Dependent variable | WEIGHT | ||||
Observations | 15 | ||||
Parameter | Value | Std error | t-statistic | p-value | |
---|---|---|---|---|---|
\beta1 | 128.8128 | 16.3083 | 7.8986 | 0.0000 | |
\beta2 | –143.1620 | 19.8332 | –7.2183 | 0.0000 | |
\beta3 | 61.9603 | 6.0084 | 10.3122 | 0.0000 | |
0.9989 | S.E. of regression | 0.2516 | |||
Adjusted R2 | 0.9987 | Model sum-of-sq. | 692.61 | ||
Log-likelihood | 1.0890 | Residual sum-of-sq. | 0.7595 | ||
2.1013 | Total sum-of-sq. | 693.37 | |||
0.2548 | F-statistic | 5471.2 | |||
Schwarz criterion | 0.3964 | p-value (F-stat) | 0.0000 |
In this table:
\hat\sigmaj=
-1 | |
\left(\hat{\sigma} | |
xx |
\right]jj
| ||||
\right) |
t=\hat\betaj/\hat\sigmaj
R2
R2
\overline{R}2=1-
n-1 | |
n-p |
(1-R2)
Ordinary least squares analysis often includes the use of diagnostic plots designed to detect departures of the data from the assumed form of the model. These are some of the common diagnostic plots:
\hat{y}
An important consideration when carrying out statistical inference using regression models is how the data were sampled. In this example, the data are averages rather than measurements on individual women. The fit of the model is very good, but this does not imply that the weight of an individual woman can be predicted with high accuracy based only on her height.
See main article: Errors-in-variables models.
This example also demonstrates that coefficients determined by these calculations are sensitive to how the data is prepared. The heights were originally given rounded to the nearest inch and have been converted and rounded to the nearest centimetre. Since the conversion factor is one inch to 2.54 cm this is not an exact conversion. The original inches can be recovered by Round(x/0.0254) and then re-converted to metric without rounding. If this is done the results become:
Const | Height | Height2 | ||
---|---|---|---|---|
Converted to metric with rounding. | 128.8128 | −143.162 | 61.96033 | |
Converted to metric without rounding. | 119.0205 | −131.5076 | 58.5046 |
Using either of these equations to predict the weight of a 5' 6" (1.6764 m) woman gives similar values: 62.94 kg with rounding vs. 62.98 kg without rounding. Thus a seemingly small variation in the data has a real effect on the coefficients but a small effect on the results of the equation.
While this may look innocuous in the middle of the data range it could become significant at the extremes or in the case where the fitted model is used to project outside the data range (extrapolation).
This highlights a common error: this example is an abuse of OLS which inherently requires that the errors in the independent variable (in this case height) are zero or at least negligible. The initial rounding to nearest inch plus any actual measurement errors constitute a finite and non-negligible error. As a result, the fitted parameters are not the best estimates they are presumed to be. Though not totally spurious the error in the estimation will depend upon relative size of the x and y errors.
We can use the least square mechanism to figure out the equation of a two body orbit in polar base co-ordinates. The equation typically used is
r(\theta)=
p | |
1-e\cos(\theta) |
r(\theta)
p
e
\theta | 43 | 45 | 52 | 93 | 108 | 116 | |
---|---|---|---|---|---|---|---|
r(\theta) | 4.7126 | 4.5542 | 4.0419 | 2.2187 | 1.8910 | 1.7599 |
e
p
First we need to represent e and p in a linear form. So we are going to rewrite the equation
r(\theta)
1 | |
r(\theta) |
=
1 | |
p |
-
e | |
p |
\cos(\theta)
\cos(\theta)
\cos(\theta-\theta0)=\cos(\theta)\cos(\theta0)+\sin(\theta)\sin(\theta0)
\cos(\theta)
\sin(\theta)
\tan\theta0=\sin(\theta0)/\cos(\theta0)
ATA\binom{x}{y}=ATb
x
1 | |
p |
y
e | |
p |
A
1 | |
p |
e | |
p |
b
1 | |
r(\theta) |
A=\begin{bmatrix}1&-0.731354\\1&-0.707107\\1&-0.615661\\1& 0.052336\\1&0.309017\\1&0.438371\end{bmatrix}
b=\begin{bmatrix}0.21220\\ 0.21958\\ 0.24741\\ 0.45071\\ 0.52883\\ 0.56820\end{bmatrix}.
On solving we get
\binom{x}{y}=\binom{0.43478}{0.30435}
so
p= | 1 |
x |
=2.3000
e=p ⋅ y=0.70001