【发布时间】:2021-04-26 08:49:28
【问题描述】:
我正在尝试编写一个 R 函数,该函数将根据列值提取社区中正数的比例。更具体地说,我有一个数据集,其中每一行都是一个人。为简化起见,第 1-5 列有关于他们个人特征的信息,第 6 列有邮政编码,第 7 列有他们报告阳性时拨打的电话号码,第 8 列有星期几,第 9 列有状态。目标是计算邮政编码、电话号码、星期几和州的聚合级别的阳性比例和数量。对于任何一个类别,我成功地使用了来自https://edwinth.github.io/blog/dplyr-recipes/ 的代码来构建一个组和汇总函数(如下)。输入数据框和列名,它将按该列上的不同值进行分组,并总结阳性的计数和比例。
group_and_summarize <- function(x, ...) {
grouping = rlang::quos(...)
temp = x %>% group_by(!!!grouping) %>% summarise(proportion = mean(positive, na.rm = TRUE), number = n())
temp = temp %>% filter(!is.na(!!!grouping))
colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
return(temp)
}
问题是,当我尝试跨多个列进行聚合时,该代码完全失败。我目前有四个字段要分组,但是一旦数据完全收集完毕,我预计会有大约 15 列。我在这里的策略是将它们中的每一个存储为列表的单独元素以供以后使用。我尝试使用
output = vector(mode = "list", length = length(aggregate_cols)) #aggregate_cols lists columns needing count and proportion.
#aggregate_cols = c("ZIP_CODE", "PHONE_NUMBER", "DAY", "STATE")
for(i in 1:length(aggregate_cols)){
output[i] = group_and_summarize(df,aggregate_cols[i])
}
但收到以下错误消息
Warning messages:
1: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
number of items to replace is not a multiple of replacement length
2: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
number of items to replace is not a multiple of replacement length
3: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
number of items to replace is not a multiple of replacement length
4: In output[i] <- group_and_summarize(df, aggregate_cols[i]) :
number of items to replace is not a multiple of replacement length
测试第一个值
> i=1
> group_and_summarize(df,aggregate_cols[i])
# A tibble: 1 x 3
`aggregate_cols[i]` proportion number
<chr> <dbl> <int>
1 ZIP_CODE 0.168 5600
任何想法如何解决这个问题?我想不出涉及 map 或 apply 系列函数的好方法,尽管我愿意接受这些。
编辑:
可复现的代码如下。
group_and_summarize_demo <- function(x, ...) {
grouping = quos(...)
temp = x %>% group_by(!!!grouping) %>% summarise(proportion = mean(am, na.rm = TRUE), number = n())
temp = temp %>% filter(!is.na(!!!grouping))
colnames(temp)[2] = paste0(colnames(temp)[1], "_proportion")
colnames(temp)[3] = paste0(colnames(temp)[1], "_count")
return(temp)
}
cars_cols = c("gear", "cyl")
output = vector(mode = "list", length = length(cars_cols))
for(i in 1:length(cars_cols)){
output[i] = group_and_summarize_demo(df,cars_cols[i]) #group_and_summarize gets count and proportion
}
> group_and_summarize_demo(mtcars, cyl)
# A tibble: 3 x 3
cyl cyl_proportion cyl_count
<dbl> <dbl> <int>
1 4 0.727 11
2 6 0.429 7
3 8 0.143 14
> cars_cols = c("gear", "cyl")
> output = vector(mode = "list", length = length(cars_cols))
> for(i in 1:length(cars_cols)){
+ output[i] = group_and_summarize_demo(df,cars_cols[i]) #group_and_summarize gets count and proportion
+ }
Show Traceback
Rerun with Debug
Error in UseMethod("group_by_") :
no applicable method for 'group_by_' applied to an object of class "function"
> cars_cols[1]
[1] "gear"
> group_and_summarize_demo(mtcars, cars_cols[1])
# A tibble: 1 x 3
`cars_cols[1]` `cars_cols[1]_proportion` `cars_cols[1]_count`
<chr> <dbl> <int>
1 gear 0.406 32
我不明白为什么这与运行 group_and_summarize_demo(mtcars,cyl); 不同我怀疑理解这将解决这个错误。
【问题讨论】:
-
如果您创建一个小的可重现示例以及预期的输出,这将更容易提供帮助。阅读how to give a reproducible example。