基于R中两组的最大值编码虚拟变量？答案

【问题标题】：Coding dummy variable based on max value of two groups in R?基于R中两组的最大值编码虚拟变量？
【发布时间】：2020-03-23 20:47:12
【问题描述】：

在this question 的基础上，我尝试为每个组的最大值创建相同的虚拟变量，但我不只是按月份列分组，而是希望按月份和年份进行分组，所以我得到了两者的最大值，即 1-2017, 2-2017 等最大值的虚拟变量。

我在这里创建了一个演示数据集以及一些代码，当我尝试按两列分组时显示了不希望的结果。

df<- data.frame(date= seq.Date(from = as.Date('2017-01-01'), by= 14, length.out = 100), var1= rnorm(20, 5, 3))
df$month<- as.numeric(strftime(df$date, "%m"))
df$year <- as.numeric(strftime(df$date, "%Y"))

df <- df %>%
    group_by(year, month) %>%
    mutate(dummy = as.integer(var1 == max(var1))) %>%
    ungroup

当我运行这段代码时，这是我得到的输出。

structure(list(date = structure(c(17167L, 17181L, 17195L, 17209L, 
17223L, 17237L, 17251L, 17265L, 17279L, 17293L, 17307L, 17321L, 
17335L, 17349L, 17363L, 17377L, 17391L, 17405L, 17419L, 17433L, 
17447L, 17461L, 17475L, 17489L, 17503L, 17517L, 17531L, 17545L, 
17559L, 17573L, 17587L, 17601L, 17615L, 17629L, 17643L, 17657L, 
17671L, 17685L, 17699L, 17713L, 17727L, 17741L, 17755L, 17769L, 
17783L, 17797L, 17811L, 17825L, 17839L, 17853L, 17867L, 17881L, 
17895L, 17909L, 17923L, 17937L, 17951L, 17965L, 17979L, 17993L, 
18007L, 18021L, 18035L, 18049L, 18063L, 18077L, 18091L, 18105L, 
18119L, 18133L, 18147L, 18161L, 18175L, 18189L, 18203L, 18217L, 
18231L, 18245L, 18259L, 18273L, 18287L, 18301L, 18315L, 18329L, 
18343L, 18357L, 18371L, 18385L, 18399L, 18413L, 18427L, 18441L, 
18455L, 18469L, 18483L, 18497L, 18511L, 18525L, 18539L, 18553L
), class = "Date"), var1 = c(7.610410421533, 4.39838519593849, 
7.29367017299385, 4.65490632032971, 5.98987372646137, 7.89362434824241, 
7.37466247923975, 3.55735383387344, 2.04726523393801, 7.85753545892588, 
8.9757840439474, 3.03010111198631, 8.12870854721674, 4.94313056382185, 
10.3008527572542, 3.38195982516332, 2.38095744572549, 2.33860273928046, 
4.55108829246478, 10.3054094468083, 7.610410421533, 4.39838519593849, 
7.29367017299385, 4.65490632032971, 5.98987372646137, 7.89362434824241, 
7.37466247923975, 3.55735383387344, 2.04726523393801, 7.85753545892588, 
8.9757840439474, 3.03010111198631, 8.12870854721674, 4.94313056382185, 
10.3008527572542, 3.38195982516332, 2.38095744572549, 2.33860273928046, 
4.55108829246478, 10.3054094468083, 7.610410421533, 4.39838519593849, 
7.29367017299385, 4.65490632032971, 5.98987372646137, 7.89362434824241, 
7.37466247923975, 3.55735383387344, 2.04726523393801, 7.85753545892588, 
8.9757840439474, 3.03010111198631, 8.12870854721674, 4.94313056382185, 
10.3008527572542, 3.38195982516332, 2.38095744572549, 2.33860273928046, 
4.55108829246478, 10.3054094468083, 7.610410421533, 4.39838519593849, 
7.29367017299385, 4.65490632032971, 5.98987372646137, 7.89362434824241, 
7.37466247923975, 3.55735383387344, 2.04726523393801, 7.85753545892588, 
8.9757840439474, 3.03010111198631, 8.12870854721674, 4.94313056382185, 
10.3008527572542, 3.38195982516332, 2.38095744572549, 2.33860273928046, 
4.55108829246478, 10.3054094468083, 7.610410421533, 4.39838519593849, 
7.29367017299385, 4.65490632032971, 5.98987372646137, 7.89362434824241, 
7.37466247923975, 3.55735383387344, 2.04726523393801, 7.85753545892588, 
8.9757840439474, 3.03010111198631, 8.12870854721674, 4.94313056382185, 
10.3008527572542, 3.38195982516332, 2.38095744572549, 2.33860273928046, 
4.55108829246478, 10.3054094468083), month = c(1, 1, 1, 2, 2, 
3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 
12, 12, 12, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 7, 7, 7, 8, 8, 
9, 9, 10, 10, 11, 11, 12, 12, 12, 1, 1, 2, 2, 3, 3, 4, 4, 5, 
5, 6, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10, 11, 11, 12, 12, 12, 1, 
1, 2, 2, 3, 3, 4, 4, 5, 5, 5, 6, 6, 7, 7, 8, 8, 9, 9, 10, 10), 
    year = c(2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 
    2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 
    2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2018, 
    2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 
    2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 2018, 
    2018, 2018, 2018, 2018, 2018, 2019, 2019, 2019, 2019, 2019, 
    2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 
    2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 2019, 
    2019, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 
    2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 2020, 
    2020, 2020), dummy = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 
    0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
    1L)), row.names = c(NA, -100L), class = c("tbl_df", "tbl", 
"data.frame"))

【问题讨论】：

这个代码每个月-年最多只有 1 个组合，有什么问题？
我想要每个月-年组合中的最大值。所以我想要一个 1 作为 1-2017 中的最大值，对于 2-2017 等等。
如果我运行你的代码，我就会明白.. 比较 aggregate(var1~month+year,data=df,max) 和 df[df$dummy==1,]
你的虚拟变量用 1s 标记所有最大值
这真的很奇怪，但我只得到 5 行，其中 dummy==1。

标签： r dplyr tidyverse

【解决方案1】：

如果我获取您提供的 data.frame 输出，并执行相同的 mutate 调用，我会得到完全不同的输出：

df %>% group_by(month,year) %>% mutate(ismax=as.integer(var1==max(var1)))
# A tibble: 100 x 6
# Groups:   month, year [46]
   date        var1 month  year dummy ismax
   <date>     <dbl> <dbl> <dbl> <int> <int>
 1 2017-01-01  7.61     1  2017     0     1
 2 2017-01-15  4.40     1  2017     0     0
 3 2017-01-29  7.29     1  2017     0     0
 4 2017-02-12  4.65     2  2017     0     0
 5 2017-02-26  5.99     2  2017     0     1
 6 2017-03-12  7.89     3  2017     0     1
 7 2017-03-26  7.37     3  2017     0     0
 8 2017-04-09  3.56     4  2017     0     1
 9 2017-04-23  2.05     4  2017     0     0
10 2017-05-07  7.86     5  2017     0     0

这些是我的系统设置，是 dplyr 版本太旧还是你加载了另一个冲突的包？

R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: OS X El Capitan 10.11.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_0.8.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.2       fansi_0.4.0      zeallot_0.1.0    utf8_1.1.4      
 [5] crayon_1.3.4     assertthat_0.2.1 R6_2.4.0         backports_1.1.5 
 [9] magrittr_1.5     pillar_1.4.2     cli_1.1.0        rlang_0.4.1     
[13] vctrs_0.2.0      glue_1.3.1       purrr_0.3.3      compiler_3.6.1  
[17] pkgconfig_2.0.3  tidyselect_0.2.5 tibble_2.1.3

否则你可以试试这个方法：

myfunc = function(x)as.integer(x$var1==max(x$var1))
df %>% group_by(year,month) %>% do(data.frame(. , ismax = myfunc(.)))
# A tibble: 100 x 6
# Groups:   year, month [46]
   date        var1 month  year dummy ismax
   <date>     <dbl> <dbl> <dbl> <int> <int>
 1 2017-01-01  7.61     1  2017     0     1
 2 2017-01-15  4.40     1  2017     0     0
 3 2017-01-29  7.29     1  2017     0     0
 4 2017-02-12  4.65     2  2017     0     0
 5 2017-02-26  5.99     2  2017     0     1
 6 2017-03-12  7.89     3  2017     0     1
 7 2017-03-26  7.37     3  2017     0     0
 8 2017-04-09  3.56     4  2017     0     1
 9 2017-04-23  2.05     4  2017     0     0
10 2017-05-07  7.86     5  2017     0     0
# … with 90 more rows

【讨论】：