In statistics, econometrics, epidemiology and related disciplines, the method of instrumental variables (IV) is used to estimate causal relationships when controlled experiments are not feasible or when a treatment is not successfully delivered to every unit in a randomized experiment.[1] Intuitively, IVs are used when an explanatory variable of interest is correlated with the error term (endogenous), in which case ordinary least squares and ANOVA give biased results. A valid instrument induces changes in the explanatory variable (is correlated with the endogenous variable) but has no independent effect on the dependent variable and is not correlated with the error term, allowing a researcher to uncover the causal effect of the explanatory variable on the dependent variable.
Instrumental variable methods allow for consistent estimation when the explanatory variables (covariates) are correlated with the error terms in a regression model. Such correlation may occur when:
Explanatory variables that suffer from one or more of these issues in the context of a regression are sometimes referred to as endogenous. In this situation, ordinary least squares produces biased and inconsistent estimates.[2] However, if an instrument is available, consistent estimates may still be obtained. An instrument is a variable that does not itself belong in the explanatory equation but is correlated with the endogenous explanatory variables, conditionally on the value of other covariates.
In linear models, there are two main requirements for using IVs:
Informally, in attempting to estimate the causal effect of some variable X ("covariate" or "explanatory variable") on another Y ("dependent variable"), an instrument is a third variable Z which affects Y only through its effect on X.
For example, suppose a researcher wishes to estimate the causal effect of smoking (X) on general health (Y).[5] Correlation between smoking and health does not imply that smoking causes poor health because other variables, such as depression, may affect both health and smoking, or because health may affect smoking. It is not possible to conduct controlled experiments on smoking status in the general population. The researcher may attempt to estimate the causal effect of smoking on health from observational data by using the tax rate for tobacco products (Z) as an instrument for smoking. The tax rate for tobacco products is a reasonable choice for an instrument because the researcher assumes that it can only be correlated with health through its effect on smoking. If the researcher then finds tobacco taxes and state of health to be correlated, this may be viewed as evidence that smoking causes changes in health.
First use of an instrument variable occurred in a 1928 book by Philip G. Wright, best known for his excellent description of the production, transport and sale of vegetable and animal oils in the early 1900s in the United States,[6] [7] while in 1945, Olav Reiersøl applied the same approach in the context of errors-in-variables models in his dissertation, giving the method its name.[8]
Wright attempted to determine the supply and demand for butter using panel data on prices and quantities sold in the United States. The idea was that a regression analysis could produce a demand or supply curve because they are formed by the path between prices and quantities demanded or supplied. The problem was that the observational data did not form a demand or supply curve as such, but rather a cloud of point observations that took different shapes under varying market conditions. It seemed that making deductions from the data remained elusive.
The problem was that price affected both supply and demand so that a function describing only one of the two could not be constructed directly from the observational data. Wright correctly concluded that he needed a variable that correlated with either demand or supply but not both – that is, an instrumental variable.
After much deliberation, Wright decided to use regional rainfall as his instrumental variable: he concluded that rainfall affected grass production and hence milk production and ultimately butter supply, but not butter demand. In this way he was able to construct a regression equation with only the instrumental variable of price and supply.[9]
Formal definitions of instrumental variables, using counterfactuals and graphical criteria, were given by Judea Pearl in 2000. Angrist and Krueger (2001) present a survey of the history and uses of instrumental variable techniques.[10] Notions of causality in econometrics, and their relationship with instrumental variables and other methods, are discussed by Heckman (2008).[11]
While the ideas behind IV extend to a broad class of models, a very common context for IV is in linear regression. Traditionally,[12] an instrumental variable is definedas a variable Z that is correlated with the independent variable X and uncorrelated with the "error term" U in the linear equation
Y=X\beta+U
Y
X
\beta
\widehat{\beta}
\operatorname{cov}(X,\widehatU)=0
min\beta(Y-X\beta)'(Y-X\beta)
X'(Y-X\widehat{\beta})=X'\widehat{U}=0
\operatorname{cov}(X,U) ≠ 0
X
Y
X
Y
X
Consider for simplicity the single-variable case. Suppose we are considering a regression with one variable and a constant (perhaps no other covariates are necessary, or perhaps we have partialed out any other relevant covariates):
y=\alpha+\betax+u
In this case, the coefficient on the regressor of interest is given by
\widehat{\beta}=
\operatorname{cov | |
(x,y)}{\operatorname{var}(x)} |
y
\begin{align} \widehat{\beta}&=
\operatorname{cov | |
(x,y)}{\operatorname{var}(x)} |
=
\operatorname{cov | |
(x,\alpha |
+\betax+u)}{\operatorname{var}(x)}\\[6pt] &=
\operatorname{cov | |
(x, |
\alpha+\betax)}{\operatorname{var}(x)}+
\operatorname{cov | |
(x,u)}{\operatorname{var}(x)}= |
\beta*+
\operatorname{cov | |
(x,u)}{\operatorname{var}(x)}, \end{align} |
where
\beta*
\beta*
\beta.
\operatorname{cov}(x,u) ≠ 0
{\beta}
x
u
z
u
z
x
u
IV techniques have been developed among a much broader class of non-linear models. General definitions of instrumental variables, using counterfactual and graphical formalism, were given by Pearl (2000; p. 248).[13] The graphical definition requires that Z satisfy the following conditions:
(Z\perp\perp
Y) | |
G\overline{X |
where
\perp\perp
G\overline{X
The counterfactual definition requires that Z satisfies
(Z\perp\perpYx) (Z\not{\perp\perp}X)
where Yx stands for the value that Y would attain had X been x and
\perp\perp
If there are additional covariates W then the above definitions are modified so that Z qualifies as an instrument if the given criteria hold conditional on W.
The essence of Pearl's definition is:
These conditions do not rely on specific functionalform of the equations and are applicable therefore tononlinear equations, where U can be non-additive(see Non-parametric analysis). They are also applicable to a system of multipleequations, in which X (and other factors) affect Y throughseveral intermediate variables. An instrumental variable need not bea cause of X; a proxy of such cause may also beused, if it satisfies conditions 1–5.[13] The exclusion restriction (condition 4) is redundant; it follows from conditions 2 and 3.
Since U is unobserved, the requirement that Z be independent of U cannot be inferred from data and must instead be determined from the model structure, i.e., the data-generating process. Causal graphs are a representation of this structure, and the graphical definition given above can be used to quickly determine whether a variable Z qualifies as an instrumental variable given a set of covariates W. To see how, consider the following example.
Suppose that we wish to estimate the effect of a university tutoring program on grade point average (GPA). The relationship between attending the tutoring program and GPA may be confounded by a number of factors. Students who attend the tutoring program may care more about their grades or may be struggling with their work. This confounding is depicted in the Figures 1–3 on the right through the bidirected arc between Tutoring Program and GPA. If students are assigned to dormitories at random, the proximity of the student's dorm to the tutoring program is a natural candidate for being an instrumental variable.
However, what if the tutoring program is located in the college library? In that case, Proximity may also cause students to spend more time at the library, which in turn improves their GPA (see Figure 1). Using the causal graph depicted in the Figure 2, we see that Proximity does not qualify as an instrumental variable because it is connected to GPA through the path Proximity
→
→
G\overline{X
G\overline{X
Now, suppose that we notice that a student's "natural ability" affects his or her number of hours in the library as well as his or her GPA, as in Figure 3. Using the causal graph, we see that Library Hours is a collider and conditioning on it opens the path Proximity
→
\leftrightarrow
Finally, suppose that Library Hours does not actually affect GPA because students who do not study in the library simply study elsewhere, as in Figure 4. In this case, controlling for Library Hours still opens a spurious path from Proximity to GPA. However, if we do not control for Library Hours and remove it as a covariate then Proximity can again be used an instrumental variable.
We now revisit and expand upon the mechanics of IV in greater detail. Suppose the data are generated by a process of the form
yi=Xi\beta+ei,
where
yi
Xi
ei
yi
Xi
\beta
The parameter vector
\beta
yi
Xi
yi
\beta
Suppose also that a regression model of nominally the same form is proposed. Given a random sample of T observations from this process, the ordinary least squares estimator is
\widehat{\beta}OLS=(XTX)-1XTy=(XTX)-1XT(X\beta+e)=\beta+(XTX)-1XTe
where X, y and e denote column vectors of length T. This equation is similar to the equation involving
\operatorname{cov}(X,y)
To recover the underlying parameter
\beta
Suppose that the relationship between each endogenous component xi and the instruments is given by
xi=Zi\gamma+vi,
The most common IV specification uses the following estimator:
\widehat{\beta}IV=(ZTX)-1ZTy
This specification approaches the true parameter as the sample gets large, so long as
ZTe=0
\widehat{\beta}IV=(ZTX)-1ZTy=(ZTX)-1ZTX\beta+(ZTX)-1ZTe → \beta
As long as
ZTe=0
ZTe=0
Now an extension: suppose that there are more instruments than there are covariates in the equation of interest, so that Z is a T × M matrix with M > K. This is often called the over-identified case. In this case, the generalized method of moments (GMM) can be used. The GMM IV estimator is
\widehat{\beta}GMM=
TP | |
(X | |
Z |
X)-1
TP | |
X | |
Z |
y,
where
PZ
TZ) | |
P | |
Z=Z(Z |
-1ZT
This expression collapses to the first when the number of instruments is equal to the number of covariates in the equation of interest. The over-identified IV is therefore a generalization of the just-identified IV.
Developing the
\betaGMM
\widehat{\beta}GMM=(XTZ(ZTZ)-1ZTX)-1XTZ(ZTZ)-1ZTy
In the just-identified case, we have as many instruments as covariates, so that the dimension of X is the same as that of Z. Hence,
XTZ,ZTZ
ZTX
\begin{align} \widehat{\beta}GMM&=(ZTX)-1(ZTZ)(XTZ)-1XTZ(ZTZ)-1ZTy\\ &=(ZTX)-1(ZTZ)(ZTZ)-1ZTy\\ &=(ZTX)-1ZTy\\ &=\widehat{\beta}IV \end{align}
There is an equivalent under-identified estimator for the case where m < k. Since the parameters are the solutions to a set of linear equations, an under-identified model using the set of equations
Z'v=0
One computational method which can be used to calculate IV estimates is two-stage least squares (2SLS or TSLS). In the first stage, each explanatory variable that is an endogenous covariate in the equation of interest is regressed on all of the exogenous variables in the model, including both exogenous covariates in the equation of interest and the excluded instruments. The predicted values from these regressions are obtained:
Stage 1: Regress each column of X on Z, (
X=Z\delta+errors
\widehat{\delta}=(ZTZ)-1ZTX,
and save the predicted values:
\widehat{X}=Z\widehat{\delta}={\color{ProcessBlue}Z(ZTZ)-1ZT
In the second stage, the regression of interest is estimated as usual, except that in this stage each endogenous covariate is replaced with the predicted values from the first stage:
Stage 2: Regress Y on the predicted values from the first stage:
Y=\widehatX\beta+noise,
which gives
\beta2SLS=
T{\color{ProcessBlue}P | |
\left(X | |
Z} |
X\right)-1
T{\color{ProcessBlue}P | |
X | |
Z}Y. |
This method is only valid in linear models. For categorical endogenous covariates, one might be tempted to use a different first stage than ordinary least squares, such as a probit model for the first stage followed by OLS for the second. This is commonly known in the econometric literature as the forbidden regression,[15] because second-stage IV parameter estimates are consistent only in special cases.[16]
The usual OLS estimator is:
(\widehatXT\widehatX)-1\widehatXTY
\widehatX=PZX
PZ
TP | |
P | |
Z=P |
ZPZ=PZ
\beta2SLS=(\widehatXT\widehatX)-1\widehatXTY=
TP | |
\left(X | |
Z |
X\right)-1
TY=\left(X | |
X | |
Z |
TP | |
Z |
X\right)-1
TP | |
X | |
ZY. |
The resulting estimator of
\beta
\beta
When the form of the structural equations is unknown, an instrumental variable
Z
x=g(z,u)
y=f(x,u)
where
f
g
Z
U
Z,X
Y
X
Y
ACE=\Pr(y\middo(x))=\operatorname{E}u[f(x,u)].
In linear analysis, there is no test to falsify the assumption the
Z
(X,Y)
X
f
g
Z
maxx\sumy[maxz\Pr(y,x\midz)]\leq1.
The exposition above assumes that the causal effect of interest does not vary across observations, that is, that
\beta
The standard IV estimator can recover local average treatment effects (LATE) rather than average treatment effects (ATE). Imbens and Angrist (1994) demonstrate that the linear IV estimate can be interpreted under weak conditions as a weighted average of local average treatment effects, where the weights depend on the elasticity of the endogenous regressor to changes in the instrumental variables. Roughly, that means that the effect of a variable is only revealed for the subpopulations affected by the observed changes in the instruments, and that subpopulations which respond most to changes in the instruments will have the largest effects on the magnitude of the IV estimate.
For example, if a researcher uses presence of a land-grant college as an instrument for college education in an earnings regression, she identifies the effect of college on earnings in the subpopulation which would obtain a college degree if a college is present but which would not obtain a degree if a college is not present. This empirical approach does not, without further assumptions, tell the researcher anything about the effect of college among people who would either always or never get a college degree regardless of whether a local college exists.
As Bound, Jaeger, and Baker (1995) note, a problem is caused by the selection of "weak" instruments, instruments that are poor predictors of the endogenous question predictor in the first-stage equation.[19] In this case, the prediction of the question predictor by the instrument will be poor and the predicted values will have very little variation. Consequently, they are unlikely to have much success in predicting the ultimate outcome when they are used to replace the question predictor in the second-stage equation.
In the context of the smoking and health example discussed above, tobacco taxes are weak instruments for smoking if smoking status is largely unresponsive to changes in taxes. If higher taxes do not induce people to quit smoking (or not start smoking), then variation in tax rates tells us nothing about the effect of smoking on health. If taxes affect health through channels other than through their effect on smoking, then the instruments are invalid and the instrumental variables approach may yield misleading results. For example, places and times with relatively health-conscious populations may both implement high tobacco taxes and exhibit better health even holding smoking rates constant, so we would observe a correlation between health and tobacco taxes even if it were the case that smoking has no effect on health. In this case, we would be mistaken to infer a causal effect of smoking on health from the observed correlation between tobacco taxes and health.
The strength of the instruments can be directly assessed because both the endogenous covariates and the instruments are observable.[20] A common rule of thumb for models with one endogenous regressor is: the F-statistic against the null that the excluded instruments are irrelevant in the first-stage regression should be larger than 10.
When the covariates are exogenous, the small-sample properties of the OLS estimator can be derived in a straightforward manner by calculating moments of the estimator conditional on X. When some of the covariates are endogenous so that instrumental variables estimation is implemented, simple expressions for the moments of the estimator cannot be so obtained. Generally, instrumental variables estimators only have desirable asymptotic, not finite sample, properties, and inference is based on asymptotic approximations to the sampling distribution of the estimator. Even when the instruments are uncorrelated with the error in the equation of interest and when the instruments are not weak, the finite sample properties of the instrumental variables estimator may be poor. For example, exactly identified models produce finite sample estimators with no moments, so the estimator can be said to be neither biased nor unbiased, the nominal size of test statistics may be substantially distorted, and the estimates may commonly be far away from the true value of the parameter.[21]
The assumption that the instruments are not correlated with the error term in the equation of interest is not testable in exactly identified models. If the model is overidentified, there is information available which may be used to test this assumption. The most common test of these overidentifying restrictions, called the Sargan–Hansen test, is based on the observation that the residuals should be uncorrelated with the set of exogenous variables if the instruments are truly exogenous.[22] The Sargan–Hansen test statistic can be calculated as
TR2