面板数据的双聚类标准误答案

【问题标题】：Double clustered standard errors for panel data面板数据的双聚类标准误
【发布时间】：2012-01-13 11:03:27
【问题描述】：

我在 R（时间和横截面）中有一个面板数据集，并且想计算按二维聚类的标准误差，因为我的残差是双向相关的。谷歌搜索我发现http://thetarzan.wordpress.com/2011/06/11/clustered-standard-errors-in-r/ 它提供了一个功能来做到这一点。这似乎有点特别，所以我想知道是否有一个包已经过测试并且这样做了吗？

我知道 sandwich 会产生 HAC 标准错误，但它不会进行双重聚类（即沿二维）。

【问题讨论】：

标签： r regression standard-error panel-data plm

【解决方案1】：

这是一个老问题。但是看到人们似乎仍在登陆它，我想我会提供一些现代方法来在 R 中进行多路聚类：

选项 1（最快）：`fixest::feols()`

library(fixest)

nlswork = haven::read_dta("http://www.stata-press.com/data/r14/nlswork.dta")

est_feols = feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork)
est_feols

## An important feature of fixest: We can _instantaneously_ compute other
## VCOV matrices / SEs on the fly with summary.fixest(). No need to re-run
## the model!
summary(est_feols, se = 'standard') ## vanilla SEs
summary(est_feols, se = 'hetero') ## robust SEs
summary(est_feols, se = 'twoway') ## diff syntax, but same as original model
summary(est_feols, cluster = c('race', 'year')) ## ditto
summary(est_feols, cluster = ~race^year) ## interacted cluster vars
summary(est_feols, cluster = ~ race + year + idcode)  ## add third cluster var (not in original model call)
## etc.

选项 2（快速）：`lfe::felm()`

library(lfe)

## Note the third "| 0 " slot means we're not using IV

est_felm = felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
summary(est_felm)

选项 3（较慢，但灵活）：`sandwich`

library(sandwich)
library(lmtest)

est_sandwich = lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork) 
coeftest(est_sandwich, vcov = vcovCL, cluster = ~ race + year)

基准测试

Aaaand，只是为了强调关于速度的观点。这是三种不同方法（使用两个固定 FE 和双向聚类）的基准。

est_feols = function() feols(ln_wage ~ age | race + year, cluster = ~race+year, data = nlswork) 
est_felm = function() felm(ln_wage ~ age | race + year | 0 | race + year, data = nlswork)
est_standwich = function() {coeftest(lm(ln_wage ~ age + factor(race) + factor(year), data = nlswork), 
                                     vcov = vcovCL, cluster = ~ race + year)}

microbenchmark(est_feols(), est_felm(), est_standwich(), times = 3)

#> Unit: milliseconds
#>             expr       min        lq      mean    median        uq       max neval cld
#>      est_feols()  11.94122  11.96158  12.55835  11.98193  12.86692  13.75191     3 a  
#>       est_felm()  87.18064  95.89905 100.69589 104.61746 107.45352 110.28957     3  b 
#>  est_standwich() 176.43502 183.95964 188.50271 191.48425 194.53656 197.58886     3   c

【讨论】：

我相信，`plm::vcovDC' 会是另一种选择吗？顺便说一句：模型估计和 vcov 计算的分离（“即时计算其他 VCOV 矩阵/SE [...]。无需重新运行模型！”）有点宽泛许多 R 包中的约定。
是的，还有其他几个（例如 clubSandwich 和 estimatr）。关于您关于估计后 SE 调整的第二点是 R 中的常见约定；我同意 ;-) grantmcdermott.com/better-way-adjust-SEs

【解决方案2】：

对于面板回归，plm 包可以沿两个维度估计集群 SE。

使用M. Petersen’s benchmark results：

require(foreign)
require(plm)
require(lmtest)
test <- read.dta("http://www.kellogg.northwestern.edu/faculty/petersen/htm/papers/se/test_data.dta")

##Double-clustering formula (Thompson, 2011)
vcovDC <- function(x, ...){
    vcovHC(x, cluster="group", ...) + vcovHC(x, cluster="time", ...) - 
        vcovHC(x, method="white1", ...)
}

fpm <- plm(y ~ x, test, model='pooling', index=c('firmid', 'year'))

所以现在你可以获得集群的 SE：

##Clustered by *group*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *time*
> coeftest(fpm, vcov=function(x) vcovHC(x, cluster="time", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.022189  1.3376   0.1811    
x           1.034833   0.031679 32.6666   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

##Clustered by *group* and *time*
> coeftest(fpm, vcov=function(x) vcovDC(x, type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.064580  0.4596   0.6458    
x           1.034833   0.052465 19.7243   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

更多详情见：

Fama-MacBeth and Cluster-Robust (by Firm and Time) Standard Errors in R。

但是，只有当您的数据可以强制转换为 pdata.frame 时，上述方法才有效。如果您有"duplicate couples (time-id)"，它将失败。在这种情况下，您仍然可以聚类，但只能沿着一个维度。

通过仅指定一个索引来欺骗plm 认为您拥有正确的面板数据集：

fpm.tr <- plm(y ~ x, test, model='pooling', index=c('firmid'))

所以现在你可以获得集群的 SE：

##Clustered by *group*
> coeftest(fpm.tr, vcov=function(x) vcovHC(x, cluster="group", type="HC1"))

t test of coefficients:

            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 0.029680   0.066952  0.4433   0.6576    
x           1.034833   0.050550 20.4714   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

您还可以使用此解决方法按更高维度或更高级别（例如industry 或country）进行聚类。但是在这种情况下，您将无法使用group（或time）effects，这是该方法的主要限制。

另一种适用于面板数据和其他类型数据的方法是 multiwayvcov 包。它允许双重聚类，但也允许在更高维度进行聚类。根据软件包的website，这是对 Arai 代码的改进：

观察结果的透明处理因缺失而下降

完整的多路（或 n 路、或 n 维或多维）聚类

使用彼得森数据和cluster.vcov()：

library("lmtest")
library("multiwayvcov")

data(petersen)
m1 <- lm(y ~ x, data = petersen)

coeftest(m1, vcov=function(x) cluster.vcov(x, petersen[ , c("firmid", "year")]))
## 
## t test of coefficients:
## 
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.029680   0.065066  0.4561   0.6483    
## x           1.034833   0.053561 19.3206   <2e-16 ***
## ---
## Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

【讨论】：

【解决方案3】：

Arai 的函数可用于聚类标准错误。他还有另一个用于多维度聚类的版本：

mcl <- function(dat,fm, cluster1, cluster2){
          attach(dat, warn.conflicts = F)
          library(sandwich);library(lmtest)
          cluster12 = paste(cluster1,cluster2, sep="")
          M1  <- length(unique(cluster1))
          M2  <- length(unique(cluster2))   
          M12 <- length(unique(cluster12))
          N   <- length(cluster1)          
          K   <- fm$rank             
          dfc1  <- (M1/(M1-1))*((N-1)/(N-K))  
          dfc2  <- (M2/(M2-1))*((N-1)/(N-K))  
          dfc12 <- (M12/(M12-1))*((N-1)/(N-K))  
          u1j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster1,  sum)) 
          u2j   <- apply(estfun(fm), 2, function(x) tapply(x, cluster2,  sum)) 
          u12j  <- apply(estfun(fm), 2, function(x) tapply(x, cluster12, sum)) 
          vc1   <-  dfc1*sandwich(fm, meat=crossprod(u1j)/N )
          vc2   <-  dfc2*sandwich(fm, meat=crossprod(u2j)/N )
          vc12  <- dfc12*sandwich(fm, meat=crossprod(u12j)/N)
          vcovMCL <- vc1 + vc2 - vc12
          coeftest(fm, vcovMCL)}

参考和使用示例见：

Clustered Standard Errors in R

【讨论】：

【解决方案4】：

Frank Harrell 的包rms（以前叫Design）有一个我在聚类时经常使用的函数：robcov。

例如，参见?robcov 的这一部分。

cluster: a variable indicating groupings. ‘cluster’ may be any type of
      vector (factor, character, integer).  NAs are not allowed.
      Unique values of ‘cluster’ indicate possibly correlated
      groupings of observations. Note the data used in the fit and
      stored in ‘fit$x’ and ‘fit$y’ may have had observations
      containing missing values deleted. It is assumed that if any
      NAs were removed during the original model fitting, an
      ‘naresid’ function exists to restore NAs so that the rows of
      the score matrix coincide with ‘cluster’. If ‘cluster’ is
      omitted, it defaults to the integers 1,2,...,n to obtain the
      "sandwich" robust covariance matrix estimate.

【讨论】：

不幸的是，robcov 仅适用于 ols 对象，但不适用于 lm 对象。你知道类似的功能适用于更主流的lm吗？

选项 1（最快）：fixest::feols()

选项 2（快速）：lfe::felm()

选项 3（较慢，但灵活）：sandwich

基准测试

选项 1（最快）：`fixest::feols()`

选项 2（快速）：`lfe::felm()`

选项 3（较慢，但灵活）：`sandwich`