【问题标题】:grouping by one column in R在R中按一列分组
【发布时间】:2014-10-23 17:25:52
【问题描述】:

我有一个数据集“df”

> df
      A    B  C
1  tanu  abc 10
2  tanu  def 20
3  tanu  ghi 15
4  tanu  jkl 28
5  tanu  mno 33
6  tanu  pqr 46
7  tanu  stu 83
8  tanu  vwx 15
9   edu  yz1 60
10  edu abc2 85

> group
[1] 3 2 3 2

我必须为每个组找到“C”列的最大值。每个组是 groupby 列“A”,包含来自向量“group”的相应行数

Group1:    
    tanu  abc 10
    tanu  def 20
    tanu  ghi 15
Group2:
    tanu  jkl 28
    tanu  mno 33
Group3:
    tanu  pqr 46
    tanu  stu 83
    tanu  vwx 15
Group4:
    edu  yz1 60
    edu abc2 85

我无法通过aggregateby 函数实现这一点。我希望我的输出是

> out
      A    B  C  
    tanu  def 20 
    tanu  mno 33 
    tanu  stu 83  
    edu  abc2 85

感谢任何帮助。 TIA。

【问题讨论】:

    标签: r group-by aggregate


    【解决方案1】:

    bywhich.max 的另一种基本 R 方式:

    do.call(rbind, 
       by(df, list(rep(seq_along(group), group)), function(g) g[which.max(g$C),]))
    
    #      A    B  C
    # 1 tanu  def 20
    # 2 tanu  mno 33
    # 3 tanu  stu 83
    # 4  edu abc2 85
    

    【讨论】:

      【解决方案2】:

      起初,我认为它是最大值或C 列和基于groupminB 变量。以下是基于此的解决方案。

      library(data.table)
       res <- setDT(df)[, list(B=B[min(group)], C=max(C)),
                   by=list(gr=rep(seq_along(group), group),A)][,gr:=NULL]
      

      查看@Matthew Plourde 的解决方案后,很明显我错了(在示例中,两者产生相同的结果)。在这种情况下,

       res <- setDT(df)[df[, max(C)==C,
                      by=list(rep(seq_along(group), group), A)]$V1]
      
      
       res
       #      A    B  C
       #1: tanu  def 20
       #2: tanu  mno 33
       #3: tanu  stu 83
       #4:  edu abc2 85
      

      或使用dplyr

        library(dplyr)
        df %>% 
            group_by(gr=rep(seq_along(group), group), A) %>% 
            filter(C==max(C))%>% 
            ungroup() %>% 
            select(-gr)
         #    A    B  C
         #1 tanu  def 20
         #2 tanu  mno 33
         #3 tanu  stu 83
         #4  edu abc2 85
      

      数据

      df <-  structure(list(A = c("tanu", "tanu", "tanu", "tanu", "tanu", 
      "tanu", "tanu", "tanu", "edu", "edu"), B = c("abc", "def", "ghi", 
      "jkl", "mno", "pqr", "stu", "vwx", "yz1", "abc2"), C = c(10L, 
      20L, 15L, 28L, 33L, 46L, 83L, 15L, 60L, 85L)), .Names = c("A", 
      "B", "C"), class = "data.frame", row.names = c("1", "2", "3", 
      "4", "5", "6", "7", "8", "9", "10"))
      

      【讨论】:

      • 出现错误:找不到函数“setDT”;加载库“data.table”。还需要其他包吗?
      • @abcdef 不,它应该单独与data.table 一起使用。您使用的是哪个版本的data.table
      【解决方案3】:

      我认为这也可以。

      s <- sapply(split(df$C, rep.int(seq_along(group), group)), which.max)
      df[s + cumsum(c(0, group[-length(group)])), ]
      #       A    B  C
      # 2  tanu  def 20
      # 5  tanu  mno 33
      # 7  tanu  stu 83
      # 10  edu abc2 85
      

      【讨论】:

        【解决方案4】:

        这可能不是最明确的答案,但它有效:)

        A = c("tanu", 
          "tanu",
          "tanu",
          "tanu",  
          "tanu",  
          "tanu",  
          "tanu",  
          "tanu",  
          "edu",  
          "edu")
        
        B = c("abc", 
          "def",
          "ghi",
          "jkl",  
          "mno",  
          "pqr",  
          "stu",  
          "vwx",  
          "yz1",  
          "abc2")
        
        C = c(10,20,15,28,33,46,83,15,60,85)
        df = data.frame(A=A, B=B, C=C)
        group = c(3,2,3,2)
        
        out = NULL
        line.nb = 1
        
        for(i in 1:length(group)){
        
         beg = line.nb
         end = line.nb + group[i]-1
         temp = df[beg:end,]
        
         res = temp[which(temp[,"C"] ==max(temp[,"C"])), ] 
         out = rbind(out,res)
        
         line.nb = line.nb+group[i]
        }
        
        out
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 1970-01-01
          • 2015-12-02
          • 2021-12-24
          • 1970-01-01
          • 2013-11-24
          • 2021-05-16
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多