Two-step M-estimators deals with M-estimation problems that require preliminary estimation to obtain the parameter of interest. Two-step M-estimation is different from usual M-estimation problem because asymptotic distribution of the second-step estimator generally depends on the first-step estimator. Accounting for this change in asymptotic distribution is important for valid inference.
The class of two-step M-estimators includes Heckman's sample selection estimator,[1] weighted non-linear least squares, and ordinary least squares with generated regressors.[2]
To fix ideas, let
\{Wi
n | |
\} | |
i=1 |
\subseteqRd
\Theta
\Gamma
Rp
Rq
m(;;;):Rd x \Theta x \Gamma → R
\hat\theta
\hat\theta:=\argmax\theta\in\Theta
1 | |
n |
\sumiml(Wi,\theta,\hat\gammar)
where
\hat\gamma
Consistency of two-step M-estimators can be verified by checking consistency conditions for usual M-estimators, although some modification might be necessary. In practice, the important condition to check is the identification condition.[2] If
\hat\gamma → \gamma*,
\gamma*
E[m(W1,\theta,\gamma*)]
\Theta
Under regularity conditions, two-step M-estimators have asymptotic normality. An important point to note is that the asymptotic variance of a two-step M-estimator is generally not the same as that of the usual M-estimator in which the first step estimation is not necessary.[3] This fact is intuitive because
\hat\gamma
\Theta
E
\partial | |
\partial\theta\partial\gamma |
m(W1,\theta0,\gamma*)=0
where
\theta0
\theta
\gamma*
\hat\gamma
E
\partial | |
\partial\theta |
m(W1,\theta0,\gamma*)=0
\theta0
E[m(W1,\theta,\gamma*)]
\hat\gamma
When the first step is a maximum likelihood estimator, under some assumptions, two-step M-estimator is more asymptotically efficient (i.e. has smaller asymptotic variance) than M-estimator with known first-step parameter. Consistency and asymptotic normality of the estimator follows from the general result on two-step M-estimators.[4]
Let be a random sample and the second-step M-estimator
\widehat{\theta}
\widehat{\theta}:=
n | |
\underset{\theta\in\Theta}{\operatorname{argmax}}\sum | |
i=1 |
m(vi,wi,zi:\theta,\widehat{\gamma})
where
\widehat{\gamma}
\widehat{\gamma}:=
n | |
\underset{\gamma\in\Gamma}{\operatorname{argmax}}\sum | |
i=1 |
logf(vit:zi,\gamma)
where f is the conditional density of V given Z. Now, suppose that given Z, V is conditionally independent of W. This is called the conditional independence assumption or selection on observables.[5] Intuitively, this condition means that Z is a good predictor of V so that once conditioned on Z, V has no systematic dependence on W. Under the conditional independence assumption, the asymptotic variance of the two-step estimator is:
E[\nabla\thetas(\theta0,
-1 | |
\gamma | |
0)] |
E[g(\theta0,\gamma0)g(\theta0,
T | |
\gamma | |
0) |
]E[\nabla\thetas(\theta0,\gamma
-1 | |
0)] |
where
\begin{align} g(\theta,\gamma)&:=s(\theta,\gamma)-E[s(\theta,\gamma)\nabla\gammad(\gamma)T]E[\nabla\gammad(\gamma)\nabla\gammad(\gamma)T]-1d(\gamma)\\ s(\theta,\gamma)&:=\nabla\thetam(V,W,Z:\theta,\gamma)\\ d(\gamma)&:=\nabla\gammalogf(V:Z,\gamma) \end{align}
and represents partial derivative with respect to a row vector. In the case where is known, the asymptotic variance is
E[\nabla\thetas(\theta0,
-1 | |
\gamma | |
0)] |
E[s(\theta0,\gamma0)s(\theta0,\gamma0)T]E[\nabla\thetas(\theta0,
-1 | |
\gamma | |
0)] |
and therefore, unless
E[s(\theta,\gamma)\nabla\gammad(\gamma)T]=0