传入变量名时，循环遍历 group_by答案

【问题标题】：For loop over a group_by when passing in a variable name传入变量名时，循环遍历 group_by
【发布时间】：2021-04-26 08:49:28
【问题描述】：

我正在尝试编写一个 R 函数，该函数将根据列值提取社区中正数的比例。更具体地说，我有一个数据集，其中每一行都是一个人。为简化起见，第 1-5 列有关于他们个人特征的信息，第 6 列有邮政编码，第 7 列有他们报告阳性时拨打的电话号码，第 8 列有星期几，第 9 列有状态。目标是计算邮政编码、电话号码、星期几和州的聚合级别的阳性比例和数量。对于任何一个类别，我成功地使用了来自https://edwinth.github.io/blog/dplyr-recipes/ 的代码来构建一个组和汇总函数（如下）。输入数据框和列名，它将按该列上的不同值进行分组，并总结阳性的计数和比例。

group_and_summarize <- function(x, ...) {
  grouping = rlang::quos(...)
  temp = x %>% group_by(!!!grouping) %>% summarise(proportion = mean(positive, na.rm = TRUE), number = n()) 
  temp = temp %>% filter(!is.na(!!!grouping))
  colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
  colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
  return(temp)
}

问题是，当我尝试跨多个列进行聚合时，该代码完全失败。我目前有四个字段要分组，但是一旦数据完全收集完毕，我预计会有大约 15 列。我在这里的策略是将它们中的每一个存储为列表的单独元素以供以后使用。我尝试使用

output = vector(mode = "list", length = length(aggregate_cols)) #aggregate_cols lists columns needing count and proportion.
    #aggregate_cols = c("ZIP_CODE", "PHONE_NUMBER", "DAY", "STATE")
for(i in 1:length(aggregate_cols)){
output[i] = group_and_summarize(df,aggregate_cols[i])
          }

但收到以下错误消息

Warning messages:
1: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length
2: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length
3: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length
4: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
  number of items to replace is not a multiple of replacement length

测试第一个值

> i=1
> group_and_summarize(df,aggregate_cols[i])
# A tibble: 1 x 3
  `aggregate_cols[i]`  proportion number
  <chr>                 <dbl>  <int>
1 ZIP_CODE              0.168   5600

任何想法如何解决这个问题？我想不出涉及 map 或 apply 系列函数的好方法，尽管我愿意接受这些。

编辑：

可复现的代码如下。

group_and_summarize_demo <- function(x, ...) {
  grouping = quos(...)
  temp = x %>% group_by(!!!grouping) %>% summarise(proportion = mean(am, na.rm = TRUE), number = n()) 
  temp = temp %>% filter(!is.na(!!!grouping))
  colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
  colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
  return(temp)
}

cars_cols = c("gear", "cyl")
output = vector(mode = "list", length = length(cars_cols))
for(i in 1:length(cars_cols)){
  output[i] = group_and_summarize_demo(df,cars_cols[i]) #group_and_summarize gets count and proportion
}


> group_and_summarize_demo(mtcars, cyl)
# A tibble: 3 x 3
    cyl cyl_proportion cyl_count
  <dbl>          <dbl>     <int>
1     4          0.727        11
2     6          0.429         7
3     8          0.143        14
> cars_cols = c("gear", "cyl")
> output = vector(mode = "list", length = length(cars_cols))
> for(i in 1:length(cars_cols)){
+   output[i] = group_and_summarize_demo(df,cars_cols[i]) #group_and_summarize gets count and proportion
+ }
 Show Traceback
 
 Rerun with Debug
 Error in UseMethod("group_by_") : 
  no applicable method for 'group_by_' applied to an object of class "function" 
> cars_cols[1]
[1] "gear"
> group_and_summarize_demo(mtcars, cars_cols[1])
# A tibble: 1 x 3
  `cars_cols[1]` `cars_cols[1]_proportion` `cars_cols[1]_count`
  <chr>                              <dbl>                <int>
1 gear                               0.406                   32

我不明白为什么这与运行 group_and_summarize_demo(mtcars,cyl); 不同我怀疑理解这将解决这个错误。

【问题讨论】：

如果您创建一个小的可重现示例以及预期的输出，这将更容易提供帮助。阅读how to give a reproducible example。

标签： r dplyr lapply

【解决方案1】：

在循环之外，您将名称直接传递给函数：

group_and_summarize_demo(mtcars, cyl)

但是，在您的循环中，您将名称作为字符串传递：

group_and_summarize_demo(mtcars, "cyl") #error

确实，在此设置中使用字符串更容易。为了让它工作，你不应该使用quos()，而应该使用syms()：

group_and_summarize_demo <- function(x, ..., quosure=TRUE) {
  if(quosure)
    grouping = quos(...)
  else
    grouping = syms(...)
  temp = x %>% 
    group_by(!!!grouping) %>% 
    summarise(proportion = mean(am, na.rm = TRUE), number = n()) 
  temp = temp %>% filter(!is.na(!!!grouping))
  colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
  colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
  return(temp)
}

group_and_summarize_demo(mtcars, cyl)
group_and_summarize_demo(mtcars, "cyl", quosure=F)

显然，在您的最终代码中，您应该选择其中之一并坚持下去。

编辑：

如果您一次只传递一个变量，那么使用省略号会显得有点矫枉过正，而且会使事情变得复杂。此外，您的示例似乎不适用于多个变量 (group_and_summarize_demo(mtcars, cyl, vs))。您可能需要考虑以下几项改进：

library(tidyverse)

group_and_summarize_demo <- function(x, gp_col) {
  gp_col = sym(gp_col)
  temp = x %>% 
    group_by(!!gp_col) %>% 
    summarise("{{gp_col}}_proportion" := mean(am, na.rm = TRUE), 
              "{{gp_col}}_count" := n()) %>% 
    filter(!is.na(!!gp_col))
  temp
}

c("gear", "cyl") %>%  
  map(~group_and_summarize_demo(mtcars, .x)) #try map_dfc() also
#> [[1]]
#> # A tibble: 3 x 3
#>    gear gear_proportion gear_count
#>   <dbl>           <dbl>      <int>
#> 1     3           0             15
#> 2     4           0.667         12
#> 3     5           1              5
#> 
#> [[2]]
#> # A tibble: 3 x 3
#>     cyl cyl_proportion cyl_count
#>   <dbl>          <dbl>     <int>
#> 1     4          0.727        11
#> 2     6          0.429         7
#> 3     8          0.143        14

^{由reprex package (v2.0.0) 于 2021-04-27 创建}

在这里，我使用:= 运算符使用dplyr::summarise() 的模板feature。我还使用了purrr::map() 而不是 for 循环，其中的迭代记为.x。

【讨论】：