根据另一列中的行子集从多列中获取最大值答案

【问题标题】：Taking the maximum value from a number of columns based on a subset of rows in another column根据另一列中的行子集从多列中获取最大值
【发布时间】：2015-02-01 13:15:49
【问题描述】：

这是我在 StackOverflow 上的第一个问题。我将尽力使其简洁明了，如果不是，我深表歉意。我也是 R 的新手。我在 StackOverflow 上四处寻找我的问题的答案。我发现了一些可能会有所帮助的点点滴滴，但目前我不确定哪种方法最适合使用，或者如何将它们组合在一起以使其全部发挥作用。

我有一个这样的数据集，叫做“per1”

   Day  Stat1 Stat2 Stat3
    10  2.12  1.84  2.11
    10  2.09  1.87  2.07
    10  2.08  1.92  2.07
    11  1.90  1.85  1.88
    11  1.87  1.85  1.93
    11  1.86  1.87  1.93

我想要做的是在每一天的每个“统计”列中找到数据的最大值。换句话说，将在每列中计算最大值的行是在 Day 列中包含相同值的行。输出如下所示：

Day  MaxStat1  MaxStat2  MaxStat3
10   2.12      1.92      2.11
11   1.87      1.87      1.93

我想创建一个循环来定义 Day 列中唯一值的数量，然后使用它来定义将在每列中计算最大值的行。但是我被困在如何根据独特的日子让 max 函数对每列中的行进行子集化。到目前为止我所拥有的是粗略的，我什至不确定它是否遵循正确的 R 规则（再次，R 的新手）

days <- unique(per1$Day)
stations <- per1[,1:3]
l <- length(days)
for (k in 1:l) {
curr_day <- subset(per1, per1$Day == days[k]) ##this defines the individual day
curr_stn <- stations[curr_day,] ##this is supposed to define the number of rows as the number of rows in curr_day
for(i in 1:stations) {  ##loop over each column
max[i] <- max(stations[curr_day,curr_stn]) ##take the maximum for each column based on the number of rows for each curr_day
}
}

我得到了

Error in stations[curr_day, ] : subscript out of bounds

所以我认为这意味着我没有正确定义我的论点。如果有人可以帮助我为这个循环设置正确的格式，那将不胜感激！任何其他更清洁/更快的方法也将受到欢迎。（我查看了“mapply”，但不知道如何编写将 Stat 列的行数定义为每个唯一天的行数的函数）

感谢您的宝贵时间。

【问题讨论】：

标签： r loops for-loop max subset

【解决方案1】：

这是一个简单的分组计算。困难的部分已经为我们完成了。我们可以使用aggregate。

aggregate(. ~ Day, per1, max)
#   Day Stat1 Stat2 Stat3
# 1  10  2.12  1.92  2.11
# 2  11  1.90  1.87  1.93

【讨论】：

我喜欢这个内置的 R 命令，而无需进入不同的包。你介意解释一下“。”是什么吗？和“〜”是为了？我假设他们以某种方式表明了论点？
我认为“。”表示整个数据框，但我不熟悉“~”作为元字符。

【解决方案2】：

R 最好的部分是不必制作循环！试试这个：

library(dplyr)
maxdat <- per1 %>%
            group_by(Day) %>%
            summarise_each(funs(max))

【讨论】：

成功了，谢谢！您介意解释一下 %>% 运算符吗？我假设 group_by、summarise_each 和 funs 函数是 dplyr 包的一部分？
@abishop 它被称为“管道”，虽然它在 dplyr 包中，但它起源于它自己的名为 magrittr 的包。它将一个函数的结果传递给下一个函数，允许您将函数链接在一起，而不必将每个步骤保存在变量中或在函数中包含函数。很酷。

【解决方案3】：

使用dplyr更新尼克的答案：

summarise_each() 已弃用并由summarise_all() 取代。相关dplyr 发行说明，https://github.com/tidyverse/dplyr/releases/tag/v0.7.0。

per1 <- data.frame(Day = c(10, 10, 10, 11, 11, 11), 
                   stat1 = rnorm(6), 
                   stat2 = runif(6), 
                   stat3 = 1:6)

per1
##   Day      stat1      stat2 stat3
## 1  10  0.5172806 0.14336084     1
## 2  10 -0.5693747 0.10477538     2
## 3  10 -0.3351060 0.77701780     3
## 4  11 -0.1472232 0.28173915     4
## 5  11  0.5093479 0.65901061     5
## 6  11 -1.8770271 0.02960309     6

library(dplyr)
maxdat <- per1 %>%
            group_by(Day) %>%
            summarise_all(max)

maxdat
## # A tibble: 2 x 4
##     Day stat1 stat2 stat3
##   <dbl> <dbl> <dbl> <dbl>
## 1  10.0 0.517 0.777  3.00
## 2  11.0 0.509 0.659  6.00

【讨论】：