【问题标题】:How to group_by and summarize multiple variables using regex?如何使用正则表达式对多个变量进行分组和汇总?
【发布时间】:2020-02-16 12:05:06
【问题描述】:

我想使用正则表达式来识别用于 group_by 的变量并有效地汇总我的数据。我不能单独做,因为我有大量的变量要汇总,并且 group_by 的变量每次都需要动态传递。 data.table 接受使用正则表达式传递分组变量,但不接受汇总变量。到目前为止,我使用 tidyverse 的尝试也没有成功。任何帮助将不胜感激。

My data:

    tempDF <- structure(list(d1 = c("A", "B", "C", "A", "C"), d2 = c(40L, 50L, 20L, 50L, 20L), 
        d3 = c(20L, 40L, 50L, 40L, 50L), d4 = c(60L, 30L, 30L,60L, 30L), p_A = c(1L, 
        3L, 2L, 3L, 2L), p_B = c(3L, 4L, 3L, 3L, 4L), p_C = c(2L, 1L, 1L,2L, 1L), p4 = c(5L, 
        5L, 4L, 5L, 4L)), class = "data.frame", row.names = c(NA, -5L))

    View(tempDF)    
    lLevels<-c("d1")
    lContinuum<-c("p_A", "p_B", "p_C")


My attempts:

    setDT(tempDF)[ , list(group_means = mean(eval((paste0(lContinuum)))), by=eval((paste0(lLevels))))] 
       group_means by
    1:          NA d1
    Warning message:
    In mean.default(eval((paste0(lContinuum)))) :
      argument is not numeric or logical: returning NA

    But a single variable works:
    setDT(tempDF)[ , list(group_means = mean(p_A)), by=eval((paste0(lLevels)))]                                            
    setDT(tempDF)[ , list(group_means = mean(p_B)), by=eval((paste0(lLevels)))]                                            
    setDT(tempDF)[ , list(group_means = mean(p_C)), by=eval((paste0(lLevels)))]                                            


Expected output:

    tempDF %>%
    group_by(d1) %>%
    summarise(p_A_mean = mean(p_A), p_B_mean = mean(p_B), p_C_mean = mean(p_C))

    # A tibble: 3 x 4
      d1    p_A_mean p_B_mean p_C_mean
      <chr>    <dbl>    <dbl>    <dbl>
    1 A            2      3          2
    2 B            3      4          1
    3 C            2      3.5        1

【问题讨论】:

    标签: r regex data.table tidyverse summarize


    【解决方案1】:

    虽然它看起来有点迂回,但将其重新整形为长形式将允许不仅按 d1 分组,而且还可以按数据集中 p_A ... p_C 的许多值进行分组。

    编辑:还添加了代码以通过正则表达式保留某些列 (d_cols)。

    library(tidyverse)
    
    tempDF <- structure(
      list(d1 = c("A", "B", "C", "A", "C"), 
           d2 = c(40L, 50L, 20L, 50L, 20L), 
           d3 = c(20L, 40L, 50L, 40L, 50L), 
           d4 = c(60L, 30L, 30L,60L, 30L),
           d5 = c("AA", "BB", "CC", "AA", "CC"), 
           p_A = c(1L, 3L, 2L, 3L, 2L), 
           p_B = c(3L, 4L, 3L, 3L, 4L), 
           p_C = c(2L, 1L, 1L,2L, 1L), 
           p4 = c(5L, 5L, 4L, 5L, 4L)), 
      class = "data.frame", 
      row.names = c(NA, -5L))
    
    # columns of d to keep, in strings
    d_cols <- str_subset(colnames(tempDF), "d[15]")
    
    tempDF %>% 
      pivot_longer(cols = matches("p_")) %>% 
      group_by(!!!syms(d_cols), name) %>% 
      summarize(mean  = mean(value)) %>% 
      pivot_wider(id_cols = d_cols,
                  values_from = mean,
                  names_prefix = "mean_")
    #> # A tibble: 3 x 5
    #> # Groups:   d1, d5 [3]
    #>   d1    d5    mean_p_A mean_p_B mean_p_C
    #>   <chr> <chr>    <dbl>    <dbl>    <dbl>
    #> 1 A     AA           2      3          2
    #> 2 B     BB           3      4          1
    #> 3 C     CC           2      3.5        1
    

    reprex package (v0.3.0) 于 2019 年 10 月 19 日创建

    【讨论】:

    • 谢谢你,@shiro。我也想动态地传递分组变量d1。有什么想法吗?
    • @Krantz。好的,使用非标准评估进行了编辑。
    【解决方案2】:

    我确信这可以更高效/更简洁,但符合规范:

    summarise_df <- function(df, grouping_var){
    
      # Store string of the grouping var name:
    
      grouping_vec <- gsub(".*[$]", "", deparse(substitute(grouping_var)))
    
      # split apply combine summary - return dataframe:
    
      tmpdf_list <- lapply(split(df[,sapply(df, is.numeric)], df[,grouping_vec]),
                      function(x){sapply(x, function(y){mean(y)})})
    
    
    }
    
    tmp <- do.call(rbind, summarise_df(df, df$d1))
    
    df <- data.frame(cbind(d1 = row.names(tmp), tmp), row.names = NULL)
    

    Summary vars 也是动态的:

    # 
    summarise_df <- function(df, grouping_var, summary_vars){
    
      # Store string of the grouping var name:
    
      grouping_vec <- gsub(".*[$]", "", deparse(substitute(grouping_var)))
    
      # split apply combine summary - return dataframe:
    
      tmpdf_list <- lapply(split(df[,summary_vars], df[,grouping_vec]),
                           function(x){sapply(x, function(y){mean(y)})})
    
    
    }
    
    tmp <- do.call(rbind, summarise_df(df, df$d1, c("p_A", "p_B", "p_C")))
    
    tmp_df <- data.frame(cbind(d1 = row.names(tmp), tmp), row.names = NULL)
    

    【讨论】:

    • 太棒了。谢谢你。
    • @Krantz 别担心!
    【解决方案3】:

    方法非常简单:

    library(data.table)
    
    setDT(tempDF)
    
    tempDF[, lapply(.SD, mean),
             by = lLevels,
            .SDcols = lContinuum]
    
       d1 p_A p_B p_C
    1:  A   2 3.0   2
    2:  B   3 4.0   1
    3:  C   2 3.5   1
    

    中的类似方法是:

    library(dplyr)
    tempDF%>%
      group_by_at(lLevels)%>%
      summarize_at(lContinuum, mean)
    
    # A tibble: 3 x 4
      d1      p_A   p_B   p_C
      <chr> <dbl> <dbl> <dbl>
    1 A         2   3       2
    2 B         3   4       1
    3 C         2   3.5     1
    

    在任何一种情况下,您都可以将lLevelslContinuum 替换为正则表达式。 选项还允许选择助手,例如 starts_with()ends_with()

    https://www.rdocumentation.org/packages/tidyselect/versions/0.2.5/topics/select_helpers .

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-07-16
      • 1970-01-01
      • 1970-01-01
      • 2015-01-19
      • 2019-08-01
      • 1970-01-01
      相关资源
      最近更新 更多