【问题标题】:Find the most frequent value by row按行查找最频繁的值
【发布时间】:2013-11-27 18:36:32
【问题描述】:

我的问题如下:

我有一个包含多个因子变量的数据集,它们具有相同的类别。我需要找到每一行最常出现的类别。在平局的情况下,可以选择任意值,但如果我能对它有更多的控制权那就太好了。

我的数据集包含一百多个因素。但是,结构是这样的:

df = data.frame(id = 1:3
                var1 = c("red","yellow","green")
                var2 = c("red","yellow","green")
                var3 = c("yellow","orange","green")
                var4 = c("orange","green","yellow"))

df
#   id   var1   var2   var3   var4
# 1  1    red    red yellow orange
# 2  2 yellow yellow orange  green
# 3  3  green  green  green yellow

解决方案应该是数据框中的一个变量,例如 var5,它包含每行最常见的类别。可以是因子也可以是数值向量(以防数据需要先转换成数值向量)

在这种情况下,我想要这个解决方案:

df$var5
# [1] "red"    "yellow" "green" 

任何建议将不胜感激!提前致谢!

【问题讨论】:

    标签: r count mode factors


    【解决方案1】:

    对于内部包,我创建了一个rowMode-函数,您可以在其中选择如何处理关联和缺失值:

    rowMode <- function(x, ties = NULL, include.na = FALSE) {
      # input checks data
      if ( !(is.matrix(x) | is.data.frame(x)) ) {
        stop("Your data is not a matrix or a data.frame.")
      }
      # input checks ties method
      if ( !is.null(ties) && !(ties %in% c("random", "first", "last")) ) {
        stop("Your ties method is not one of 'random', 'first' or 'last'.")
      }
      # set ties method to 'random' if not specified
      if ( is.null(ties) ) ties <- "random"
      
      # create row frequency table
      rft <- table(c(row(x)), unlist(x), useNA = c("no","ifany")[1L + include.na])
      
      # get the mode for each row
      colnames(rft)[max.col(rft, ties.method = ties)]
    }
    

    几种可能的输出(基于不同的参数选项):

    > rowMode(DF[,-1])
     [1] "B" "E" "B" "E" "B" "C" "B" "E" "A" "E"
    > rowMode(DF[,-1], ties = "first")
     [1] "B" "B" "B" "A" "B" "C" "B" "E" "A" "E"
    > rowMode(DF[,-1], ties = "first", include.na = TRUE)
     [1] "B" NA  "B" NA  "B" "C" "B" "E" "A" "E"
    > rowMode(DF[,-1], ties = "last", include.na = TRUE)
     [1] "B" NA  NA  NA  "B" "C" "B" "E" "D" "E"
    > rowMode(DF[,-1], ties = "last")
     [1] "B" "C" "B" "E" "B" "C" "B" "E" "D" "E"
    

    使用过的数据:

    set.seed(2020)
    DF <- data.frame(id = 1:10, matrix(sample(c(LETTERS[1:5], NA_character_), 60, TRUE), ncol = 6))
    

    【讨论】:

      【解决方案2】:

      这是另一个基本 R 选项:

      tab <- table(data.frame(as.vector(row(df[,-1L])), unlist(df[,-1L])))
      colnames(tab)[max.col(tab, "first")]
      

      或另一种data.table 方法:

      melt(as.data.table(df), id.vars="id")[
          order(id, value), ri := rowid(rleid(value))][,
              value[which.max(ri)], id]$V1
      

      计时码:

      library(data.table)
      set.seed(0L)
      nr <- 1e5L
      nc <- 4L
      DF <- data.frame(id=1L:nr, as.data.frame(matrix(sample(letters, nr*nc, TRUE), ncol=nc)))
      DT <- as.data.table(DF)
      
      mtd0 <- function(df) apply(df,1,function(x) names(which.max(table(x))))
      
      Mode <- function(x) {
          ux <- unique(x)
          ux[which.max(tabulate(match(x, ux)))]
      }
      
      mtd_dt <- function(dt) melt(dt, id.vars="id")[, Mode(value), id]$V1
      
      mtd_dt2 <- function(dt) melt(dt, id.vars="id")[
          order(id, value), ri := rowid(rleid(value))][,
              value[which.max(ri)], id]$V1
      
      mtd2 <- function(df) {
          tab <- table(data.frame(as.vector(row(df[,-1L])), unlist(df[,-1L])))
          colnames(tab)[max.col(tab, "first")]
      }
      
      df = data.frame(id = 1:3,
          var1 = c("red","yellow","green"),
          var2 = c("red","yellow","green"),
          var3 = c("yellow","orange","green"),
          var4 = c("orange","green","yellow"))
      
      a0 <- mtd0(df)
      identical(a0, mtd_dt(as.data.table(df)))
      #[1] TRUE
      
      identical(a0, mtd2(df))
      #[1] TRUE
      
      identical(a0, mtd_dt2(as.data.table(df)))
      #[1] TRUE
      
      microbenchmark::microbenchmark(times=1L, mtd0(DF), mtd_dt(DT), mtd_dt2(DT), mtd2(DF))
      

      时间安排:

      Unit: milliseconds
              expr        min         lq       mean     median         uq        max neval
          mtd0(DF) 10083.9941 10083.9941 10083.9941 10083.9941 10083.9941 10083.9941     1
        mtd_dt(DT)  1056.2319  1056.2319  1056.2319  1056.2319  1056.2319  1056.2319     1
       mtd_dt2(DT)   168.6183   168.6183   168.6183   168.6183   168.6183   168.6183     1
          mtd2(DF)   519.2030   519.2030   519.2030   519.2030   519.2030   519.2030     1
      

      【讨论】:

        【解决方案3】:

        如果您的数据非常大,您可能需要考虑使用data.table 包。

        # Generate the data
        nrow <- 10^5
        id <- 1:nrow
        colors <- c("red","yellow","green")
        var1 <- sample(colors, nrow, replace = TRUE)
        var2 <- sample(colors, nrow, replace = TRUE)
        var3 <- sample(colors, nrow, replace = TRUE)
        var4 <- sample(colors, nrow, replace = TRUE)
        
        Mode <- function(x) {
            ux <- unique(x)
            ux[which.max(tabulate(match(x, ux)))]
        }
        

        Chargaff 的解决方案很简单,并且在某些情况下效果很好。使用data.table 可以获得小幅性能提升(~20%)。

        df <- data.frame(cbind(id, var1, var2, var3, var4))
        system.time(apply(df, 1, Mode))
        #   user  system elapsed
        #  1.242   0.018   1.264
        
        library(data.table)
        dt <- data.table(cbind(id, var1, var2, var3, var4))
        system.time(melt(dt, measure = patterns('var'))[, Mode(value1), by = id])
        #   user  system elapsed
        #  1.020   0.012   1.034
        

        【讨论】:

        • 请注意,如果NA 是最频繁的元素,则此Mode 函数将返回NA,而names(which.max(table(x))) 将返回最频繁的非NA 元素
        【解决方案4】:

        类似:

        apply(df,1,function(x) names(which.max(table(x))))
        [1] "red"    "yellow" "green" 
        

        如果出现平局,which.max 取第一个最大值。来自 which.max 帮助页面:

        确定位置,即(第一个)的索引 数值向量的最小值或最大值。

        例如:

        var4 <- c("yellow","green","yellow")
        df <- data.frame(cbind(id, var1, var2, var3, var4))
        
        > df
          id   var1   var2   var3   var4
        1  1    red    red yellow yellow
        2  2 yellow yellow orange  green
        3  3  green  green  green yellow
        
        apply(df,1,function(x) names(which.max(table(x))))
        [1] "red"    "yellow" "green" 
        

        【讨论】:

        • 干得好,比我的干净。没有意识到我可以跳过所有的转换、取消列表等。
        • 非常感谢您提供此解决方案。我只是在我自己的数据上尝试过它,它工作得很好!请您为我澄清一下,这种方法如何解决关系?谢谢!
        • 我编辑了我的答案以用平局说明案例。学习如何使用帮助页面是一个好习惯。很高兴我的解决方案对您有用。
        • 非常感谢 - 我真的很感激。是的,我同意你对帮助页面的价值的看法——下次我一定会检查帮助页面。
        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2018-10-01
        • 2016-07-05
        • 2021-12-09
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多