In econometrics and statistics, the generalized method of moments (GMM) is a generic method for estimating parameters in statistical models. Usually it is applied in the context of semiparametric models, where the parameter of interest is finite-dimensional, whereas the full shape of the data's distribution function may not be known, and therefore maximum likelihood estimation is not applicable.
The method requires that a certain number of moment conditions be specified for the model. These moment conditions are functions of the model parameters and the data, such that their expectation is zero at the parameters' true values. The GMM method then minimizes a certain norm of the sample averages of the moment conditions, and can therefore be thought of as a special case of minimum-distance estimation.[1]
The GMM estimators are known to be consistent, asymptotically normal, and most efficient in the class of all estimators that do not use any extra information aside from that contained in the moment conditions. GMM were advocated by Lars Peter Hansen in 1982 as a generalization of the method of moments,[2] introduced by Karl Pearson in 1894. However, these estimators are mathematically equivalent to those based on "orthogonality conditions" (Sargan, 1958, 1959) or "unbiased estimating equations" (Huber, 1967; Wang et al., 1997).
Suppose the available data consists of T observations, where each observation Yt is an n-dimensional multivariate random variable. We assume that the data come from a certain statistical model, defined up to an unknown parameter . The goal of the estimation problem is to find the “true” value of this parameter, θ0, or at least a reasonably close estimate.
A general assumption of GMM is that the data Yt be generated by a weakly stationary ergodic stochastic process. (The case of independent and identically distributed (iid) variables Yt is a special case of this condition.)
In order to apply GMM, we need to have "moment conditions", that is, we need to know a vector-valued function g(Y,θ) such that
m(\theta0)\equiv\operatorname{E}[g(Yt,\theta0)]=0,
The basic idea behind GMM is to replace the theoretical expected value E[⋅] with its empirical analog—sample average:
\hat{m}(\theta)\equiv
1 | |
T |
T | |
\sum | |
t=1 |
g(Yt,\theta)
By the law of large numbers, for large values of T, and thus we expect that . The generalized method of moments looks for a number which would make as close to zero as possible. Mathematically, this is equivalent to minimizing a certain norm of (norm of m, denoted as ||m||, measures the distance between m and zero). The properties of the resulting estimator will depend on the particular choice of the norm function, and therefore the theory of GMM considers an entire family of norms, defined as
\|\hat{m}(\theta)
2 | |
\| | |
W |
=\hat{m}(\theta)TW\hat{m}(\theta),
mT
\hat\theta=\operatorname{arg}min\theta\in\Theta(
1 | |
T |
T | |
\sum | |
t=1 |
T | |
g(Y | |
t,\theta)) |
\hat{W}(
1 | |
T |
T | |
\sum | |
t=1 |
g(Yt,\theta))
Under suitable conditions this estimator is consistent, asymptotically normal, and with right choice of weighting matrix also asymptotically efficient.
Consistency is a statistical property of an estimator stating that, having a sufficient number of observations, the estimator will converge in probability to the true value of parameter:
\hat\theta\xrightarrow{p}\theta0 as T\toinfty.
\hat{W}T\xrightarrow{p}W,
W\operatorname{E}[g(Yt,\theta)]=0
\theta=\theta0,
\Theta\subsetRk
g(Y,\theta)
\operatorname{E}[style\sup\theta\in\Theta\lVertg(Y,\theta)\rVert]<infty.
The second condition here (so-called Global identification condition) is often particularly hard to verify. There exist simpler necessary but not sufficient conditions, which may be used to detect non-identification problem:
\theta0
W\operatorname{E}[\nabla\thetag(Yt,\theta0)]
In practice applied econometricians often simply assume that global identification holds, without actually proving it.[3]
Asymptotic normality is a useful property, as it allows us to construct confidence bands for the estimator, and conduct different tests. Before we can make a statement about the asymptotic distribution of the GMM estimator, we need to define two auxiliary matrices:
G=\operatorname{E}[\nabla\thetag(Yt,\theta0)], \Omega=\operatorname{E}[g(Yt,\theta0)g(Yt,\theta
T | |
0) |
]
\sqrt{T}(\hat\theta-
T | |
\theta | |
0) \xrightarrow{d} l{N}[0,(G |
WG)-1GTW\OmegaWTG(GTWTG)-1].
Conditions:
\hat\theta
\Theta\subsetRk
g(Y,\theta)
\theta0
\operatorname{E}[\lVertg(Yt,\theta)\rVert2]<infty,
\operatorname{E}[style\sup\theta\in\lVert\nabla\thetag(Yt,\theta)\rVert]<infty,
G'WG
So far we have said nothing about the choice of matrix W, except that it must be positive semi-definite. In fact any such matrix will produce a consistent and asymptotically normal GMM estimator, the only difference will be in the asymptotic variance of that estimator. It can be shown that taking
W\propto \Omega-1
In this case the formula for the asymptotic distribution of the GMM estimator simplifies to
\sqrt{T}(\hat\theta-
T | |
\theta | |
0) \xrightarrow{d} l{N}[0,(G |
\Omega-1G)-1]
The proof that such a choice of weighting matrix is indeed locally optimal is often adopted with slight modifications when establishing efficiency of other estimators. As a rule of thumb, a weighting matrix inches closer to optimality when it turns into an expression closer to the Cramér–Rao bound.
Proof. We will consider the difference between asymptotic variance with arbitrary W and asymptotic variance with W=\Omega-1 W=\Omega-1 | ||
V(W)-V(\Omega-1) | =(GTWG)-1GTW\OmegaWG(GTWG)-1-(GT\Omega-1G)-1 | |
=(GTWG)-1(GTW\OmegaWG-GTWG(GT\Omega-1G)-1GTWG)(GTWG)-1 | ||
=(GTWG)-1GTW\Omega1/2(I-\Omega-1/2G(GT\Omega-1G)-1GT\Omega-1/2)\Omega1/2WG(GTWG)-1 | ||
=A(I-B)AT, | ||
where we introduced matrices A and B in order to slightly simplify notation; I is an identity matrix. We can see that matrix B here is symmetric and idempotent: B2=B I-B=(I-B)(I-B)T | ||
=A(I-B)(I-B)TAT=(A(I-B))(A(I-B))T\geq0 |
One difficulty with implementing the outlined method is that we cannot take because, by the definition of matrix Ω, we need to know the value of θ0 in order to compute this matrix, and θ0 is precisely the quantity we do not know and are trying to estimate in the first place. In the case of Yt being iid we can estimate W as
\hat{W}T(\hat\theta)=(
1 | |
T |
T | |
\sum | |
t=1 |
g(Yt,\hat\theta)g(Y
T | |
t,\hat\theta) |
)-1.
Several approaches exist to deal with this issue, the first one being the most popular:
Another important issue in implementation of minimization procedure is that the function is supposed to search through (possibly high-dimensional) parameter space Θ and find the value of θ which minimizes the objective function. No generic recommendation for such procedure exists, it is a subject of its own field, numerical optimization.
See main article: article and Sargan–Hansen test. When the number of moment conditions is greater than the dimension of the parameter vector θ, the model is said to be over-identified. Sargan (1958) proposed tests for over-identifying restrictions based on instrumental variables estimators that are distributed in large samples as Chi-square variables with degrees of freedom that depend on the number of over-identifying restrictions. Subsequently, Hansen (1982) applied this test to the mathematically equivalent formulation of GMM estimators. Note, however, that such statistics can be negative in empirical applications where the models are misspecified, and likelihood ratio tests can yield insights since the models are estimated under both null and alternative hypotheses (Bhargava and Sargan, 1983).
Conceptually we can check whether
\hat{m}(\hat\theta)
\hat{m}(\theta)=0
\theta
\theta0
m(\theta0)=0
Formally we consider two hypotheses:
H0: m(\theta0)=0
H1: m(\theta) ≠ 0, \forall\theta\in\Theta
Under hypothesis
H0
J\equivT ⋅ (
1 | |
T |
T | |
\sum | |
t=1 |
T | |
g(Y | |
t,\hat\theta)) |
\hat{W}T(
1 | |
T |
T | |
\sum | |
t=1 |
2 | |
g(Y | |
k-\ell |
H0,
where
\hat\theta
\theta0
\hat{W}T
\Omega-1
\Omega-1
\Omega-1
Under the alternative hypothesis
H1
J \xrightarrow{p} infty
H1
To conduct the test we compute the value of J from the data. It is a nonnegative number. We compare it with (for example) the 0.95 quantile of the
2 | |
\chi | |
k-\ell |
H0
J>
| |||||||
q | |||||||
0.95 |
H0
J<
| |||||||
q | |||||||
0.95 |
Many other popular estimation techniques can be cast in terms of GMM optimization:
In method of moments, an alternative to the original (non-generalized) Method of Moments (MoM) is described, and references to some applications and a list of theoretical advantages and disadvantages relative to the traditional method are provided. This Bayesian-Like MoM (BL-MoM) is distinct from all the related methods described above, which are subsumed by the GMM.[4] [5] The literature does not contain a direct comparison between the GMM and the BL-MoM in specific applications.