标准差之间/之内答案

【问题标题】：Between/within standard deviations标准差之间/之内
【发布时间】：2013-01-04 22:05:53
【问题描述】：

在处理分层/多级/面板数据集时，采用返回可用变量的组内和组间标准差的包可能非常有用。

这是Stata中的以下数据可以通过命令轻松完成的事情

xtsum, i(momid)

我做了一个研究，但我找不到任何可以做到这一点的R 包..

编辑：

只是为了修正想法，分层数据集的示例可能是这样的：

son_id       mom_id      hispanic     mom_smoke     son_birthweigth

  1            1            1            1              3950
  2            1            1            0              3890
  3            1            1            0              3990
  1            2            0            1              4200
  2            2            0            1              4120
  1            3            0            0              2975
  2            3            0            1              2980

“多级”结构是由每个母亲（较高级别）有两个或多个儿子（较低级别）这一事实给出的。因此，每个母亲都定义了一组观察结果。

因此，每个数据集变量可以在母亲之间和母亲之间变化，或者仅在母亲之间变化。 birtweigth 因母亲而异，但在同一个母亲中也是如此。相反，hispanic 是为同一个母亲固定的。

例如，son_birthweigth 的母内方差为：

# mom1 means
    bwt_mean1 <- (3950+3890+3990)/3
    bwt_mean2 <- (4200+4120)/2
    bwt_mean3 <- (2975+2980)/2

# Within-mother variance for birthweigth
    ((3950-bwt_mean1)^2 + (3890-bwt_mean1)^2 + (3990-bwt_mean1)^2 + 
    (4200-bwt_mean2)^2 + (4120-bwt_mean2)^2 + 
    (2975-bwt_mean3)^2 + (2980-bwt_mean3)^2)/(7-1)

而母亲之间的差异是：

# overall mean of birthweigth:
# mean <- sum(data$son_birthweigth)/length(data$son_birthweigth)
    mean <- (3950+3890+3990+4200+4120+2975+2980)/7

# within variance:
    ((bwt_mean1-mean)^2 + (bwt_mean2-mean)^2 + (bwt_mean3-mean)^2)/(3-1)

【问题讨论】：

您是指原始时刻还是来自分层模型的估计？如果是后者，VarCorr 会做你想做的事吗（来自nlme::lme 或lme4::lmer）？
我的意思是变量的经验分布的样本矩，其中每个变量的整体标准差可以分为簇内和簇间分量。
@Stezzo 是的，您提供数据。最好也给出预期的结果。不清楚，要不要计算son_birthweigth相对于其他分类变量的矩？
@agstudy 我举了一个数字例子来让事情更清楚。感谢您提供任何其他帮助

标签： r hierarchical-data stata multi-level

【解决方案1】：

我不知道您的 stata 命令应该重现什么，但要回答关于问题的第二部分层次结构，用list很容易做到这一点。例如，您定义这样的结构：

tree = list(
      "var1" = list(
         "panel" = list(type ='p',mean = 1,sd=0)
         ,"cluster" = list(type = 'c',value = c(5,8,10)))
      ,"var2" = list(
          "panel" = list(type ='p',mean = 2,sd=0.5)
         ,"cluster" = list(type="c",value =c(1,2)))
)

创建这个lapply 很容易使用列表

tree <- lapply(list('var1','var2'),function(x){ 
  ll <- list(panel= list(type ='p',mean = rnorm(1),sd=0), ## I use symbol here not name
             cluster= list(type = 'c',value = rnorm(3)))  ## R prefer symbols
})
names(tree) <-c('var1','var2')

你可以用str查看他的结构

str(tree)
List of 2
 $ var1:List of 2
  ..$ panel  :List of 3
  .. ..$ type: chr "p"
  .. ..$ mean: num 0.284
  .. ..$ sd  : num 0
  ..$ cluster:List of 2
  .. ..$ type : chr "c"
  .. ..$ value: num [1:3] 0.0722 -0.9413 0.6649
 $ var2:List of 2
  ..$ panel  :List of 3
  .. ..$ type: chr "p"
  .. ..$ mean: num -0.144
  .. ..$ sd  : num 0
  ..$ cluster:List of 2
  .. ..$ type : chr "c"
  .. ..$ value: num [1:3] -0.595 -1.795 -0.439

OP 澄清后编辑

我认为包reshape2 是你想要的。我将在这里演示。

这里的想法是为了进行多层次分析，我们需要重塑数据。

首先将变量分为两组：标识符和测量变量。图书馆（重塑2） dat.m

str(dat.m)
'data.frame':   21 obs. of  4 variables:
 $ son_id  : Factor w/ 3 levels "1","2","3": 1 2 3 1 2 1 2 1 2 3 ...
 $ mom_id  : Factor w/ 3 levels "1","2","3": 1 1 1 2 2 3 3 1 1 1 ...
 $ variable: Factor w/ 3 levels "hispanic","mom_smoke",..: 1 1 1 1 1 1 1 2 2 2 ...
 $ value   : num  1 1 1 0 0 0 0 1 0 0 ..

一旦你有“moten”形式的数据，你可以“cast”重新排列成你想要的形状：

# mom1 means for all variable
 acast(dat.m,variable~mom_id,mean)
                           1    2      3
hispanic           1.0000000    0    0.0
mom_smoke          0.3333333    1    0.5
son_birthweigth 3943.3333333 4160 2977.5
# Within-mother variance for birthweigth

acast(dat.m,variable~mom_id,function(x) sum((x-mean(x))^2))
                           1    2    3
hispanic           0.0000000    0  0.0
mom_smoke          0.6666667    0  0.5
son_birthweigth 5066.6666667 3200 12.5

## overall mean of each variable
acast(dat.m,variable~.,mean)
[,1]
hispanic           0.4285714
mom_smoke          0.5714286
son_birthweigth 3729.2857143

【讨论】：

只是快速浏览一下 -- 你确定要sum((x-mean(x))^2) 而不是var(x)？
您对差异的看法是正确的，但整体过程似乎还可以

【解决方案2】：

我知道这个问题已经有四年了，但最近我想在 R 中做同样的事情并想出了以下函数。这取决于dplyr 和tibble。其中：df 是数据框，columns 是用于子集数据框的数值向量，individual 是包含个体的列。

xtsumR<-function(df,columns,individuals){
  df<-dplyr::arrange_(df,individuals)
  panel<-tibble::tibble()
  for (i in columns){
    v<-df %>% dplyr::group_by_() %>%
      dplyr::summarize_(
        mean=mean(df[[i]]),
        sd=sd(df[[i]]),
        min=min(df[[i]]),
        max=max(df[[i]])
      )
    v<-tibble::add_column(v,variacao="overal",.before=-1)
    v2<-aggregate(df[[i]],list(df[[individuals]]),"mean")[[2]]
    sdB<-sd(v2)
    varW<-df[[i]]-rep(v2,each=12) #
    varW<-varW+mean(df[[i]])
    sdW<-sd(varW)
    minB<-min(v2)
    maxB<-max(v2)
    minW<-min(varW)
    maxW<-max(varW)
    v<-rbind(v,c("between",NA,sdB,minB,maxB),c("within",NA,sdW,minW,maxW))
    panel<-rbind(panel,v)
  }
  var<-rep(names(df)[columns])
  n1<-rep(NA,length(columns))
  n2<-rep(NA,length(columns))
  var<-c(rbind(var,n1,n1))
  panel$var<-var
  panel<-panel[c(6,1:5)]
  names(panel)<-c("variable","variation","mean","standard.deviation","min","max")
  panel[3:6]<-as.numeric(unlist(panel[3:6]))
  panel[3:6]<-round(unlist(panel[3:6]),2)
  return(panel)
}

【讨论】：