The purpose of this page is to provide supplementary materials for the ordinary least squares article, reducing the load of the main article with mathematics and improving its accessibility, while at the same time retaining the completeness of exposition.
Define the
i
ri=yi-
n | |
\sum | |
j=1 |
Xij\betaj.
Then the objective
S
S=
m | |
\sum | |
i=1 |
2. | |
r | |
i |
Given that S is convex, it is minimized when its gradient vector is zero (This follows by definition: if the gradient vector is not zero, there is a direction in which we can move to minimize it further – see maxima and minima.) The elements of the gradient vector are the partial derivatives of S with respect to the parameters:
\partialS | |
\partial\betaj |
m | |
=2\sum | |
i=1 |
r | ||||
|
(j=1,2,...,n).
The derivatives are
\partialri | |
\partial\betaj |
=-Xij.
Substitution of the expressions for the residuals and the derivatives into the gradient equations gives
\partialS | |
\partial\betaj |
=
m | |
2\sum | |
i=1 |
\left(yi-\sum
n | |
k=1 |
Xik\betak\right)(-Xij) (j=1,2,...,n).
Thus if
\widehat\beta
m | |
2\sum | |
i=1 |
\left(yi-\sum
n | |
k=1 |
Xik\widehat\betak\right)(-Xij)=0 (j=1,2,...,n).
Upon rearrangement, we obtain the normal equations:
m | |
\sum | |
i=1 |
n | |
\sum | |
k=1 |
XijXik\widehat\betak=\sum
m | |
i=1 |
Xijyi (j=1,2,...,n).
The normal equations are written in matrix notation as
(XTX)\widehat{\boldsymbol{\beta}}=XTy
The solution of the normal equations yields the vector
\widehat{\boldsymbol{\beta}}
The normal equations can be derived directly from a matrix representation of the problem as follows. The objective is to minimize
S(\boldsymbol{\beta})=l\|y-X\boldsymbol\betar\|2=(y-X\boldsymbol\beta)\rm(y-X\boldsymbol\beta)=y\rmy-\boldsymbol\beta\rmX\rmy-y\rmX\boldsymbol\beta+\boldsymbol\beta\rmX\rmX\boldsymbol\beta.
Here
(\boldsymbol\beta\rmX\rmy)\rm=y\rmX\boldsymbol\beta
y
\boldsymbol\beta\rmX\rmy=y\rmX\boldsymbol\beta
S(\boldsymbol{\beta})=y\rmy-2\boldsymbol\beta\rmX\rmy+\boldsymbol\beta\rmX\rmX\boldsymbol\beta.
Differentiating this with respect to
\boldsymbol\beta
-X\rmy+(X\rmX){\boldsymbol{\beta}}=0,
which is equivalent to the above-given normal equations. A sufficient condition for satisfaction of the second-order conditions for a minimum is that
X
X\rmX
When
X\rmX
\boldsymbol\beta
S(\boldsymbol{\beta})=y\rmy-2\boldsymbol\beta\rmX\rmy+\boldsymbol\beta\rmX\rmX\boldsymbol\beta
can be written as
\langle\boldsymbol\beta,\boldsymbol\beta\rangle-2\langle\boldsymbol\beta,(X\rmX)-1X\rmy\rangle+\langle(X\rmX)-1X\rmy,(X\rmX)-1X\rmy\rangle+C,
where
C
y
X
\langle ⋅ , ⋅ \rangle
\langlex,y\rangle=x\rm(X\rmX)y.
It follows that
S(\boldsymbol{\beta})
\langle\boldsymbol\beta-(X\rmX)-1X\rmy,\boldsymbol\beta-(X\rmX)-1X\rmy\rangle+C
and therefore minimized exactly when
\boldsymbol\beta-(X\rmX)-1X\rmy=0.
In general, the coefficients of the matrices
X,\boldsymbol{\beta}
y
\boldsymbol{\widehat{\beta}}
S(\boldsymbol{\beta})
\displaystyleS(\boldsymbol{\beta})=\langley-X\boldsymbol{\beta},y-X\boldsymbol{\beta}\rangle=\langley,y\rangle-\overline{\langleX\boldsymbol{\beta},y\rangle}-{\overline{\langley,X\boldsymbol{\beta}\rangle}}+\langleX\boldsymbol{\beta},X\boldsymbol{\beta}\rangle=y\rm\overline{y
\dagger
We should now take derivatives of
S(\boldsymbol{\beta})
\betaj
\betaj
\betaj=
R | |
\beta | |
j |
+
I | |
i\beta | |
j |
and the derivatives change into
\partialS | |
\partial\betaj |
=
\partialS | ||||||||
|
| |||||||||
\partial\betaj |
+
\partialS | ||||||||
|
| |||||||||
\partial\betaj |
=
\partialS | ||||||||
|
-i
\partialS | ||||||||
|
(j=1,2,3,\ldots,n).
After rewriting
S(\boldsymbol{\beta})
\betaj
\begin{align} | \partialS | |||||||
|
={}&
m | |
-\sum | |
i=1 |
(\overline{X}ijyi+\overline{y}iXij)+
m | |
2\sum | |
i=1 |
Xij\overline{X}ij
R | |
\beta | |
j |
+
m | |
\sum | |
i=1 |
n | |
\sum | |
k ≠ j |
(Xij\overline{X}ik\overline{\beta}k+\betakXik\overline{X}ij),\\[8pt] &{}-i{
\partialS | ||||||||
|
which, after adding it together and comparing to zero (minimization condition for
\boldsymbol{\widehat{\beta}}
m | |
\sum | |
i=1 |
Xij\overline{y}i=
m | |
\sum | |
i=1 |
n | |
\sum | |
k=1 |
Xij\overline{X}ik\overline{\widehat{\beta}}k (j=1,2,3,\ldots,n).
In matrix form:
bf{X}\rm
Using matrix notation, the sum of squared residuals is given by
S(\beta)=(y-X\beta)T(y-X\beta).
Since this is a quadratic expression, the vector which gives the global minimum may be found via matrix calculus by differentiating with respect to the vector
\beta
0=
dS | |
d\beta |
(\widehat\beta)=
d | |
d\beta |
(yTy-\betaTXTy-yTX\beta+\betaTX
TX\beta)| | |
\beta=\widehat\beta |
=-2XTy+2XTX\widehat\beta
By assumption matrix X has full column rank, and therefore XTX is invertible and the least squares estimator for β is given by
\widehat\beta=(XTX)-1XTy
\widehat\beta
\widehat\beta
\begin{align}\operatorname{E}[\widehat\beta]&=\operatorname{E}[(XTX)-1XT(X\beta+\varepsilon)]\\ &=\beta+\operatorname{E}[(XTX)-1XT\varepsilon]\\ &=\beta+\operatorname{E}[\operatorname{E}[(XTX)-1XT\varepsilon\midX]]\\ &=\beta+\operatorname{E}[(XTX)-1XT\operatorname{E}[\varepsilon\midX]] &=\beta, \end{align}
where E[''ε''|''X''] = 0 by assumptions of the model. Since the expected value of
\widehat{\beta}
\beta
\beta
For the variance, let the covariance matrix of
\varepsilon
\operatorname{E}[\varepsilon\varepsilonT]=\sigma2I
I
m x m
\begin{align} \operatorname{E}[(\widehat\beta-\beta)(\widehat\beta-\beta)T]&=\operatorname{E}[((XTX)-1XT\varepsilon)((XTX)-1XT\varepsilon)T]\\ &=\operatorname{E}[(XTX)-1XT\varepsilon\varepsilonTX(XTX)-1]\\ &=(XTX)-1XT\operatorname{E}[\varepsilon\varepsilonT]X(XTX)-1\\ &=(XTX)-1XT\sigma2X(XTX)-1\\ &=\sigma2(XTX)-1XTX(XTX)-1\\ &=\sigma2(XTX)-1, \end{align}
where we used the fact that
\widehat{\beta}-\beta
\varepsilon
(XTX)-1XT
For a simple linear regression model, where
\beta=[\beta0,\beta
T | |
1] |
\beta0
\beta1
\begin{align} \sigma2(XTX)-1&= \sigma2\left(\begin{pmatrix}1&1& … \\x1&x2& … \end{pmatrix}\begin{pmatrix}1&x1\\1&x2\ \vdots&\vdots\end{pmatrix}\right)-1\\[6pt] &=\sigma2
m | |
\left(\sum | |
i=1 |
\begin{pmatrix}1&xi\\xi&
2\end{pmatrix} | |
x | |
i |
\right)-1\\[6pt] &=\sigma2\begin{pmatrix}m&\sumxi\\\sumxi&\sum
2\end{pmatrix} | |
x | |
i |
-1\\[6pt] &=\sigma2 ⋅
1 | ||||||||||||||
|
\begin{pmatrix}\sum
2& | |
x | |
i |
-\sumxi\\-\sumxi&m\end{pmatrix}\\[6pt] &=\sigma2 ⋅
1 | |
m\sum{(xi-\bar{x |
)2}}\begin{pmatrix}\sum
2& | |
x | |
i |
-\sumxi\\-\sumxi&m\end{pmatrix}\\[8pt] \operatorname{Var}(\widehat\beta1)&=
\sigma2 | |||||||||
|
)2}. \end{align}
\widehat\sigma2
\widehat\sigma2=\tfrac{1}{n}y'My=\tfrac{1}{n}(X\beta+\varepsilon)'M(X\beta+\varepsilon)=\tfrac{1}{n}\varepsilon'M\varepsilon
Now we can recognize ε′Mε as a 1×1 matrix, such matrix is equal to its own trace. This is useful because by properties of trace operator, tr(AB) = tr(BA), and we can use this to separate disturbance ε from matrix M which is a function of regressors X:
\operatorname{E}\widehat\sigma2=\tfrac{1}{n}\operatorname{E}[\operatorname{tr}(\varepsilon'M\varepsilon)]=\tfrac{1}{n}\operatorname{tr}(\operatorname{E}[M\varepsilon\varepsilon'])
Using the Law of iterated expectation this can be written as
\operatorname{E}\widehat\sigma2=\tfrac{1}{n}\operatorname{tr}(\operatorname{E}[M\operatorname{E}[\varepsilon\varepsilon'|X]]) =\tfrac{1}{n}\operatorname{tr}(\operatorname{E}[\sigma2MI]) =\tfrac{1}{n}\sigma2\operatorname{E}[\operatorname{tr}M]
Recall that M = I - P where P is the projection onto linear space spanned by columns of matrix X. By properties of a projection matrix, it has p = rank(X) eigenvalues equal to 1, and all other eigenvalues are equal to 0. Trace of a matrix is equal to the sum of its characteristic values, thus tr(P) = p, and tr(M) = n - p. Therefore,
\operatorname{E}\widehat\sigma2=
n-p | |
n |
\sigma2
Since the expected value of
\widehat\sigma2
\sigma2
\sigma2
\widehat\sigma2
\widehat\beta
\widehat\beta
\widehat\beta=(\tfrac{1}{n}X'X)-1\tfrac{1}{n}X'y=\beta+(\tfrac{1}{n}X'X)-1\tfrac{1}{n}X'\varepsilon=\beta + (
1 | |
n |
n | |
\sum | |
i=1 |
xix'
-1 | ||
( | ||
i) |
1 | |
n |
n | |
\sum | |
i=1 |
xi\varepsiloni)
1 | |
n |
n | |
\sum | |
i=1 |
xix'i \xrightarrow{p} \operatorname{E}[xix
|
,
1 | |
n |
n | |
\sum | |
i=1 |
xi\varepsiloni \xrightarrow{p} \operatorname{E}[xi\varepsiloni]=0
\widehat\beta
\widehat\beta \xrightarrow{p} \beta+
-1 | |
nQ | |
xx |
⋅ 0=\beta
The central limit theorem tells us that
1 | |
\sqrt{n |
V=\operatorname{Var}[xi\varepsiloni]=
2x | |
\operatorname{E}[\varepsilon | |
ix' |
i]=
2\mid | |
\operatorname{E}[\operatorname{E}[\varepsilon | |
i |
xi] xix'i]=\sigma2
Qxx | |
n |
Applying Slutsky's theorem again we'll have
\sqrt{n}(\widehat\beta-\beta)=(
1 | |
n |
n | |
\sum | |
i=1 |
xix'
-1 | ||
( | ||
i) |
1 | |
\sqrt{n |
Maximum likelihood estimation is a generic technique for estimating the unknown parameters in a statistical model by constructing a log-likelihood function corresponding to the joint distribution of the data, then maximizing this function over all possible parameter values. In order to apply this method, we have to make an assumption about the distribution of y given X so that the log-likelihood function can be constructed. The connection of maximum likelihood estimation to OLS arises when this distribution is modeled as a multivariate normal.
Specifically, assume that the errors ε have multivariate normal distribution with mean 0 and variance matrix σ2I. Then the distribution of y conditionally on X is
y\midX \sim l{N}(X\beta,\sigma2I)
\begin{align} l{L}(\beta,\sigma2\midX)&=ln(
1 | |
(2\pi)n/2(\sigma2)n/2 |
| ||||||
e |
)\\[6pt] &=-
n | |
2 |
ln2\pi-
n | |
2 |
ln\sigma2-
1 | |
2\sigma2 |
(y-X\beta)'(y-X\beta) \end{align}
\begin{align}
\partiall{L | |
Since we have assumed in this section that the distribution of error terms is known to be normal, it becomes possible to derive the explicit expressions for the distributions of estimators
\widehat\beta
\widehat\sigma2
\widehat\beta=(X'X)-1X'y=(X'X)-1X'(X\beta+\varepsilon)=\beta+(X'X)-1X'l{N}(0,\sigma2I)
\widehat\beta\midX \sim l{N}(\beta,\sigma2(X'X)-1).
Similarly the distribution of
\widehat\sigma2
\begin{align} \widehat\sigma2&=\tfrac{1}{n}(y-X(X'X)-1X'y)'(y-X(X'X)-1X'y)\\[5pt] &=\tfrac{1}{n}(My)'My\\[5pt] &=\tfrac{1}{n}(X\beta+\varepsilon)'M(X\beta+\varepsilon)\\[5pt] &=\tfrac{1}{n}\varepsilon'M\varepsilon, \end{align}
where
M=I-X(X'X)-1X'
\tfrac{n}{\sigma2}\widehat\sigma2\midX=
2 | |
(\varepsilon/\sigma)'M(\varepsilon/\sigma) \sim \chi | |
n-p |
Moreover, the estimators
\widehat\beta
\widehat\sigma2
\widehat\beta
\widehat{y}=X\widehat\beta=Py=X\beta+P\varepsilon
\widehat\beta
\widehat\sigma2
\widehat\beta
\widehat\sigma2
We look for
\widehat{\alpha}
\widehat{\beta}
min\widehat{\alpha,\widehat{\beta}}\operatorname{SSE}\left(\widehat{\alpha},\widehat{\beta}\right)\equivmin\widehat{\alpha,\widehat{\beta}}
n | |
\sum | |
i=1 |
\left(yi-\widehat{\alpha}-\widehat{\beta}
2 | |
x | |
i\right) |
To find a minimum take partial derivatives with respect to
\widehat{\alpha}
\widehat{\beta}
\begin{align} &
\partial | |
\partial\widehat{\alpha |
Before taking partial derivative with respect to
\widehat{\beta}
\widehat{\alpha}.
min\widehat{\alpha,\widehat{\beta}}
n | |
\sum | |
i=1 |
\left[yi-\left(\bar{y}-\widehat{\beta}\bar{x}\right)-\widehat{\beta}xi\right]2=min\widehat{\alpha,\widehat{\beta}}
n | |
\sum | |
i=1 |
\left[\left(yi-\bar{y}\right)-\widehat{\beta}\left(xi-\bar{x}\right)\right]2
Now, take the derivative with respect to
\widehat{\beta}
\begin{align} &
\partial | |
\partial\widehat{\beta |
And finally substitute
\widehat{\beta}
\widehat{\alpha}
\widehat{\alpha}=\bar{y}-\widehat{\beta}\bar{x}