R：高效/可扩展以计算列统计数据答案

【问题标题】：R: efficient/scalable for calculating column wise statsR：高效/可扩展以计算列统计数据
【发布时间】：2017-02-01 14:07:48
【问题描述】：

我需要使用以下信息计算按列统计：

  > library(dplyr)
  > Input <- data_frame(id=c(1,2,2,3,3,3),status=c(T,T,T,F,F,F),attri1=c(T,T,F,F,F,F), attri2=c(T,T,T,T,T,F))
  > Input
  Source: local data frame [6 x 4]

       id status attri1 attri2
    (dbl)  (lgl)  (lgl)  (lgl)
  1     1   TRUE   TRUE   TRUE
  2     2   TRUE   TRUE   TRUE
  3     2   TRUE  FALSE   TRUE
  4     3  FALSE  FALSE   TRUE
  5     3  FALSE  FALSE   TRUE
  6     3  FALSE  FALSE  FALSE

通过以下过程生成输出。基本上，sTaT 表示status ==T，对应的属性是T。sFaT 表示status ==F 和attribute == F。 sFaTuId 基于 sFaT 和计数唯一 ID。

  > Output <- data_frame(Attri=names(Input)[c(-1,-2)],sTaT=0,sFaT=0, sTaTuId=0)
  > for (as in Output$Attri){
         sTaT <- Input %>% filter_(as) %>% filter(status) %>% nrow()
         sFaT <- Input %>% filter_(as) %>% filter(!status) %>% nrow()
         sFaTuId <-  Input %>% filter_(as) %>% filter(!status) 
             %>%   select(id) %>% unique() %>% nrow()
         Output[Output$Attri==as,]$sTaT <- sTaT
         Output[Output$Attri==as,]$sFaT <- sFaT
         Output[Output$Attri==as,]$sFaTuId <- sFaTuId
         }

  > Output
  Source: local data frame [2 x 4]

     Attri  sTaT  sFaT sFaTuId
     (chr) (dbl) (dbl)   (dbl)
  1 attri1     2     0       0
  2 attri2     3     2       1

但是，当有很多行和属性列时，这个过程会很慢。有没有一种有效的方法来计算这个？

【问题讨论】：

标签： r dplyr

【解决方案1】：

我们可以通过将数据集转换为“长”格式（gather），按“Attri”分组并执行summarise

library(tidyr)
library(dplyr)
gather(Input, Attri, Val, attri1:attri2) %>% 
         group_by(Attri) %>% 
         summarise(sTatT = sum(status & Val), 
                   sFaT = sum(!status & Val), 
                   sFaTuId = n_distinct(id[!status & Val]))
# A tibble: 2 × 4
#   Attri sTatT  sFaT sFaTuId
#   <chr> <int> <int>   <int>
#1 attri1     2     0       0
#2 attri2     3     2       1

另一个选项是melt 来自data.table

library(data.table)
melt(setDT(Input), measure = patterns("^attri\\d+"),
   variable.name = "Attri")[,.(sTatT = sum(status & value),
    sFaT = sum(!status & value), sFaTuId = uniqueN(id[!status & value])) , .(Attri)]
#     Attri sTatT sFaT sFaTuId
#1: attri1     2    0       0
#2: attri2     3    2       1

【讨论】：

感谢@akrun 的及时回复。运行代码后，出现以下错误“错误：n_distinct() 的输入必须是数据集中的单个变量名”
@HappyCoding 基于示例，它适用于我使用dplyr_0.5.0
@HappyCoding 我不确定n_distinct 的行为是否改变了。你的dplyr 是什么版本？你能用data.table中的uniqueN(id[!status&Val])或base R中的length(unique(id替换n_distinct吗？
是0.4.3，更新包后问题解决。此外，在原始包中，n_distinct 单独工作，但不在管道中。
对于 attri1:attri2，它实际上可以跨越到 attrin，其中 n 在运行时决定。如何让它更有活力？

【解决方案2】：

我发现 doparallel 将是一种潜在的解决方案。

library(doParallel)
no_cores <- detectCores()-1
cl <- makeCluster(no_cores,type = "FORK")
registerDoParallel(cl)
calStats2 <- function (as, id, status){
  tmp <-  (as & status)
  sTaT <- tmp[tmp==TRUE] %>% length()

  tmp <- as & (!status)
  sFaT <- tmp[tmp==TRUE] %>% length()
  sTaTuId <-  id[as&(!status)==TRUE]  %>% unique()  %>% length()
  return(data.frame(c(sTaT,sFaT,sTaTuId)))
}
result <- foreach(i = 3:4, .combine = data.frame) %dopar%  calStats(Input[i], Input$id,Input$status)
names(result) <- names(Input)[c(-1,-2)]
result <- result %>% t()
colnames(result)
colnames(result)<- c("sTaT","sFaT","sTaTuId")
stopCluster(cl)

【讨论】：