【问题标题】:Summary statistics from aggregated groups using data.table使用 data.table 汇总组的汇总统计信息
【发布时间】:2019-03-19 03:44:32
【问题描述】:

我有一个具有这种结构的数据集:

library(data.table)    
dt <- data.table(
  record=c(1:20),
  area=rep(LETTERS[1:4], c(4, 6, 3, 7)), 
  score=c(1,1:3,2:3,1,1,1,2,2,1,2,1,1,1,1,1:3),
  cluster=c("X", "Y", "Z")[c(1,1:3,3,2,1,1:3,1,1:3,3,3,3,1:3)]
)

我想汇总数据,以便针对给定分数(例如 1)确定每个区域中最常见的聚类。我还希望计算一些基本频率和百分比,输出如下所示:

dt_summary_for_1_score <- data.table(
  area=c("A","B","C","D"),
  cluster_mode=c("X","X","X","Z"),
  cluster_pct = c(100,66.6,100,80),
  cluster_freq = c(2,2,1,4),
  record_freq = c(2,3,1,5)
)

理想情况下,我想要一个使用data.table 的解决方案。谢谢。

【问题讨论】:

  • 你搜索过SO吗?肯定有使用带有 data.table 对象的聚合函数的例子吗?如果您已经完成搜索,但在应用答案时遇到困难,您应该引用实例并说明这些困难发生的位置
  • 不清楚cluster_pctcluster_freqrecord_freq来自哪里
  • 它们是我想要的输出。因此,如果您对 data.table 进行了子集化,因此仅存在 1 的分数,这些值将与它们相关
  • 我还在 SO 上进行了搜索以尝试找到答案,虽然有一些例子可以做类似的事情,但没有什么可以为自己的目的重新设计
  • 出现平局怎么办

标签: r data.table aggregate summarize


【解决方案1】:

我会利用frank,尽管sort(table(cluster)) 的解决方案也是可能的。

dt_summary = 
  dt[ , .N, keyby = .(area, score, cluster)
      ][ , {
        idx = frank(-N, ties.method = 'min') == 1
        NN = sum(N)
        .(
          cluster_mode = cluster[idx],
          cluster_pct = 100*N[idx]/NN,
          cluster_freq = N[idx],
          record_freq = NN
        )
      }, by = .(area, score)]

要获得带有score == 1 的示例,我们可以对其进行子集化:

dt_summary[score == 1]
#    area score cluster_mode cluster_pct cluster_freq record_freq
# 1:    A     1            X   100.00000            2           2
# 2:    B     1            X    66.66667            2           3
# 3:    C     1            X   100.00000            1           1
# 4:    D     1            Z    80.00000            4           5

这会在平局的情况下返回不同的。您可以尝试使用 cluster_mode = paste(cluster[idx], collapse = '|')cluster_mode = list(cluster[idx]) 之类的替代方法。

分解逻辑:

# Count how many times each cluster shows up with each area/score
dt[ , .N, keyby = .(area, score, cluster)
   ][ , {

    # Rank each cluster's count within each area/score & take the top;
    #   ties.method = 'min' guarantees that if there's
    #   a tie for "winner", _both_ will get rank 1
    #   (by default, ties.method = 'average')
    # Note that it is slightly inefficient to negate N
    #   in order to sort in descending order, especially
    #   if there are a large number of groups. We could
    #   either vectorize negation by using -.N in the 
    #   previous step or by using frankv (a lower-level
    #   version of frank) which has an 'order' argument
    idx = frank(-N, ties.method = 'min') == 1

    # calculate here since it's used twice
    NN = sum(N)

    .(
      # use [idx] to subset and make sure there are
      #   only as many rows on output as there are
      #   top-ranked clusters for this area/score
      cluster_mode = cluster[idx],
      cluster_pct = 100*N[idx]/NN,
      cluster_freq = N[idx],
      record_freq = NN
    )
  }, by = .(area, score)]

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2015-09-22
    • 1970-01-01
    • 2019-08-18
    • 2012-01-07
    • 2016-01-28
    • 2012-04-08
    相关资源
    最近更新 更多