Clustered standard errors (or Liang-Zeger standard errors)[1] are measurements that estimate the standard error of a regression parameter in settings where observations may be subdivided into smaller-sized groups ("clusters") and where the sampling and/or treatment assignment is correlated within each group.[2] [3] Clustered standard errors are widely used in a variety of applied econometric settings, including difference-in-differences[4] or experiments.[5]
Analogous to how Huber-White standard errors are consistent in the presence of heteroscedasticity and Newey–West standard errors are consistent in the presence of accurately-modeled autocorrelation, clustered standard errors are consistent in the presence of cluster-based sampling or treatment assignment. Clustered standard errors are often justified by possible correlation in modeling residuals within each cluster; while recent work suggests that this is not the precise justification behind clustering,[6] it may be pedagogically useful.
Clustered standard errors are often useful when treatment is assigned at the level of a cluster instead of at the individual level. For example, suppose that an educational researcher wants to discover whether a new teaching technique improves student test scores. She therefore assigns teachers in "treated" classrooms to try this new technique, while leaving "control" classrooms unaffected. When analyzing her results, she may want to keep the data at the student level (for example, to control for student-level observable characteristics). However, when estimating the standard error or confidence interval of her statistical model, she realizes that classical or even heteroscedasticity-robust standard errors are inappropriate because student test scores within each class are not independently distributed. Instead, students in classes with better teachers have especially high test scores (regardless of whether they receive the experimental treatment) while students in classes with worse teachers have especially low test scores. The researcher can cluster her standard errors at the level of a classroom to account for this aspect of her experiment.[7]
While this example is very specific, similar issues arise in a wide variety of settings. For example, in many panel data settings (such as difference-in-differences) clustering often offers a simple and effective way to account for non-independence between periods within each unit (sometimes referred to as "autocorrelation in residuals"). Another common and logically distinct justification for clustering arises when a full population cannot be randomly sampled, and so instead clusters are sampled and then units are randomized within cluster. In this case, clustered standard errors account for the uncertainty driven by the fact that the researcher does not observe large parts of the population of interest.[8]
A useful mathematical illustration comes from the case of one-way clustering in an ordinary least squares (OLS) model. Consider a simple model with N observations that are subdivided in C clusters. Let
Y
n x 1
X
n x m
\beta
m x 1
e
n x 1
Y=X\beta+e
As is standard with OLS models, we minimize the sum of squared residuals
e
\hat{\beta}
min\beta(Y-X\beta)2
⇒ X'(Y-X\hat{\beta})=0
⇒ \hat{\beta}=(X'X)-1X'Y
From there, we can derive the classic "sandwich" estimator:
V(\hat{\beta})=V((X'X)-1X'Y)=V(\beta+(X'X)-1X'e)=V((X'X)-1X'e)=(X'X)-1X'ee'X(X'X)-1
Denoting
\Omega\equivee'
V(\hat{\beta})=(X'X)-1X'\OmegaX(X'X)-1
While one can develop a plug-in estimator by defining
\hat{e}\equivY-X\hat{\beta}
\hat{\Omega}\equiv\hat{e}\hat{e}'
V({\hat{\beta}})
N → infty
\Omega
\sigma2
V(\hat{\beta})=\sigma2(X'X)-1
\Omega
Clustered standard errors assume that
\Omega
Xc
\Omegac
X
\Omega
X'\OmegaX=\sumcX'c\OmegacXc
By constructing plug-in matrices
\hat{\Omega}c
V(\hat{\beta})
c
V(\hat{\beta})
\hat{V}(\hat{\beta})=(X'X)-1\sumcX'c\hat{\Omega}cXc(X'X)-1
C | |
C-1 |
n-1 | |
n-k |
.