R：如何在保留其他列的同时聚合某些列答案

【问题标题】：R: How to aggregate some columns while keeping other columnsR：如何在保留其他列的同时聚合某些列
【发布时间】：2017-11-21 16:35:22
【问题描述】：

我遇到了与here 描述的类似问题，但我尝试过的解决方案都没有。

给定这样的表格：

Date    Exercise    Category    Weight  Reps    EstMax  RepxWeight  Note
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   easy
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   kinda easy
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   tired
4/2/16  Bench Press Chest   95  5   110.8175    5x95    hard
4/2/16  Bench Press Chest   135 2   143.991 2x135   not hard
4/9/16  Bench Press Chest   135 2   143.991 2x135   a little hard
4/9/16  Bench Press Chest   135 2   143.991 2x135   super tired
4/18/16 Deadlift    Legs    155 8   196.292 8x155   …
4/18/16 Deadlift    Legs    155 5   180.8075    5x155   bad day
5/8/16  Deadlift    Legs    185 3   203.4815    3x185   good day
5/8/16  Deadlift    Legs    185 3   203.4815    3x185   felt easy
5/8/16  Bench Press Chest   115 4   130.318 4x115   easy
5/8/16  Bench Press Chest   115 4   130.318 4x115   hard

我想aggregate 来获取基于多个其他列（例如Date 和Exercise）的特定列（例如EstMax）具有max 值的行，但还要保留所有行中的其他列。如果多个条目具有相同的最大值，则取第一个条目。

预期的输出如下所示：

Date    Exercise    Category    Weight  Reps    EstMax  RepxWeight  Note
4/2/16  Deadlift    Legs    135 7   166.4685    7x135   easy
4/2/16  Bench Press Chest   135 2   143.991 2x135   not hard
4/9/16  Bench Press Chest   135 2   143.991 2x135   a little hard
4/18/16 Deadlift    Legs    155 8   196.292 8x155   …
5/8/16  Deadlift    Legs    185 3   203.4815    3x185   good day
5/8/16  Bench Press Chest   115 4   130.318 4x115   hard

我尝试过的一些方法的示例；在每种情况下，“额外的列”最终都被用作聚合的因素，这不是我想要的。

data <- structure(list(Date = structure(c(2L, 2L, 2L, 2L, 2L, 3L, 3L, 
1L, 1L, 4L, 4L, 4L, 4L), .Label = c("4/18/16", "4/2/16", "4/9/16", 
"5/8/16"), class = "factor"), Exercise = structure(c(2L, 2L, 
2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("Bench Press", 
"Deadlift"), class = "factor"), Category = structure(c(2L, 2L, 
2L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("Chest", 
"Legs"), class = "factor"), Weight = c(135L, 135L, 135L, 95L, 
135L, 135L, 135L, 155L, 155L, 185L, 185L, 115L, 115L), Reps = c(7L, 
7L, 7L, 5L, 2L, 2L, 2L, 8L, 5L, 3L, 3L, 4L, 4L), EstMax = c(166.4685, 
166.4685, 166.4685, 110.8175, 143.991, 143.991, 143.991, 196.292, 
180.8075, 203.4815, 203.4815, 130.318, 130.318), RepxWeight = structure(c(6L, 
6L, 6L, 5L, 1L, 1L, 1L, 7L, 4L, 2L, 2L, 3L, 3L), .Label = c("2x135", 
"3x185", "4x115", "5x155", "5x95", "7x135", "8x155"), class = "factor"), 
    Note = structure(c(4L, 8L, 11L, 7L, 9L, 2L, 10L, 1L, 3L, 
    6L, 5L, 4L, 7L), .Label = c("…", "a little hard", "bad day", 
    "easy", "felt easy", "good day", "hard", "kinda easy", "not hard", 
    "super tired", "tired"), class = "factor")), .Names = c("Date", 
"Exercise", "Category", "Weight", "Reps", "EstMax", "RepxWeight", 
"Note"), class = "data.frame", row.names = c(NA, -13L))

# base R
aggregate(EstMax ~ Date + Exercise, data = data, FUN = max)
# Date    Exercise   EstMax
# 1  4/2/16 Bench Press 143.9910
# 2  4/9/16 Bench Press 143.9910
# 3  5/8/16 Bench Press 130.3180
# 4 4/18/16    Deadlift 196.2920
# 5  4/2/16    Deadlift 166.4685
# 6  5/8/16    Deadlift 203.4815

aggregate(EstMax ~ Date + Exercise + RepxWeight + Note, data = data, FUN = max)
# Date    Exercise RepxWeight          Note   EstMax
# 1  4/18/16    Deadlift      8x155             … 196.2920
# 2   4/9/16 Bench Press      2x135 a little hard 143.9910
# 3  4/18/16    Deadlift      5x155       bad day 180.8075
# 4   5/8/16 Bench Press      4x115          easy 130.3180
# 5   4/2/16    Deadlift      7x135          easy 166.4685
# 6   5/8/16    Deadlift      3x185     felt easy 203.4815
# 7   5/8/16    Deadlift      3x185      good day 203.4815
# 8   5/8/16 Bench Press      4x115          hard 130.3180
# 9   4/2/16 Bench Press       5x95          hard 110.8175
# 10  4/2/16    Deadlift      7x135    kinda easy 166.4685
# 11  4/2/16 Bench Press      2x135      not hard 143.9910
# 12  4/9/16 Bench Press      2x135   super tired 143.9910
# 13  4/2/16    Deadlift      7x135         tired 166.4685


# data table
library("data.table")
data_dt <- data.table(data)
data_dt[ , max(EstMax), by = c("Date", "Exercise")]
# Date    Exercise       V1
# 1:  4/2/16    Deadlift 166.4685
# 2:  4/2/16 Bench Press 143.9910
# 3:  4/9/16 Bench Press 143.9910
# 4: 4/18/16    Deadlift 196.2920
# 5:  5/8/16    Deadlift 203.4815
# 6:  5/8/16 Bench Press 130.3180

data_dt[, max(EstMax), .(Date, Exercise, Weight, Reps, RepxWeight, Note)]
# Date    Exercise Weight Reps RepxWeight          Note       V1
# 1:  4/2/16    Deadlift    135    7      7x135          easy 166.4685
# 2:  4/2/16    Deadlift    135    7      7x135    kinda easy 166.4685
# 3:  4/2/16    Deadlift    135    7      7x135         tired 166.4685
# 4:  4/2/16 Bench Press     95    5       5x95          hard 110.8175
# 5:  4/2/16 Bench Press    135    2      2x135      not hard 143.9910
# 6:  4/9/16 Bench Press    135    2      2x135 a little hard 143.9910
# 7:  4/9/16 Bench Press    135    2      2x135   super tired 143.9910
# 8: 4/18/16    Deadlift    155    8      8x155             … 196.2920
# 9: 4/18/16    Deadlift    155    5      5x155       bad day 180.8075
# 10:  5/8/16    Deadlift    185    3      3x185      good day 203.4815
# 11:  5/8/16    Deadlift    185    3      3x185     felt easy 203.4815
# 12:  5/8/16 Bench Press    115    4      4x115          easy 130.3180
# 13:  5/8/16 Bench Press    115    4      4x115          hard 130.3180

特别喜欢基础 R 解决方案。还看到了 which.max() 函数，它可能会有所帮助，但无法弄清楚如何将其应用于此。

Only keep min value for each factor level

How to select the row with the maximum value in each group

aggregating multiple columns in data.table

How to aggregate some columns while keeping other columns in R?

【问题讨论】：

您按聚合哪些列？
按Date 和Exercise 聚合以获得每天每个练习的最大EstMax 值
它不是重复的，因为“特别喜欢基本 R 解决方案。”。正如我所描述的，其他解决方案并没有给出我想要的结果。

标签： r

【解决方案1】：

我知道您正在寻找一个基本的 R 解决方案，但与此同时，这里有一个 dplyr 一个：

library(dplyr)

data %>% 
  group_by(Date, Exercise) %>% 
  slice(which.max(EstMax))

# # A tibble: 6 x 8
# # Groups:   Date, Exercise [6]
#      Date    Exercise Category Weight  Reps   EstMax RepxWeight          Note
#    <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>        <fctr>
# 1 4/18/16    Deadlift     Legs    155     8 196.2920      8x155             …
# 2  4/2/16 Bench Press    Chest    135     2 143.9910      2x135      not hard
# 3  4/2/16    Deadlift     Legs    135     7 166.4685      7x135          easy
# 4  4/9/16 Bench Press    Chest    135     2 143.9910      2x135 a little hard
# 5  5/8/16 Bench Press    Chest    115     4 130.3180      4x115          easy
# 6  5/8/16    Deadlift     Legs    185     3 203.4815      3x185      good day

编辑

data.table 不是我的强项，但为了完整起见，这是我的尝试：

library(data.table)

setDT(data)[, .SD[which.max(EstMax)], by = .(Date, Exercise)]

#       Date    Exercise Category Weight Reps   EstMax RepxWeight          Note
# 1:  4/2/16    Deadlift     Legs    135    7 166.4685      7x135          easy
# 2:  4/2/16 Bench Press    Chest    135    2 143.9910      2x135      not hard
# 3:  4/9/16 Bench Press    Chest    135    2 143.9910      2x135 a little hard
# 4: 4/18/16    Deadlift     Legs    155    8 196.2920      8x155             …
# 5:  5/8/16    Deadlift     Legs    185    3 203.4815      3x185      good day
# 6:  5/8/16 Bench Press    Chest    115    4 130.3180      4x115          easy

【讨论】：

迄今为止最简单的dplyr 解决方案，模仿base-R 解决方案，但可能更具可读性。
这里的data.table 方法运行良好。但是，如果我用sum(EstMax, na.rm = TRUE) 替换which.max，我会得到一个包含所有NA 的表格。知道发生了什么吗？

【解决方案2】：

一种（不太正确的）方法，保留是为了显示一个独立汇总所有数字列的问题：

grpvar <- c("Date", "Exercise", "Category")
merge(
  aggregate(data[,c("Weight", "Reps", "EstMax")], by = data[grpvar], FUN = max),
  aggregate(data[,c("RepxWeight", "Note")], by = data[grpvar], FUN = function(a) a[1]),
  by = grpvar
)
#      Date    Exercise Category Weight Reps   EstMax RepxWeight          Note
# 1 4/18/16    Deadlift     Legs    155    8 196.2920      8x155           ...
# 2  4/2/16 Bench Press    Chest    135    5 143.9910       5x95          hard
# 3  4/2/16    Deadlift     Legs    135    7 166.4685      7x135          easy
# 4  4/9/16 Bench Press    Chest    135    2 143.9910      2x135 a little hard
# 5  5/8/16 Bench Press    Chest    115    4 130.3180      4x115          easy
# 6  5/8/16    Deadlift     Legs    185    3 203.4815      3x185      good day

在4/2/16 上，您的卧推显示最大重量为 135 次，最大次数为 5 次，但两者并未出现在您的数据中的同一行中。

这是一种稍微（更正确）不同的方法，使用您对which.max 的想法：

do.call(rbind,
        by(data, data[c("Date", "Exercise")],
           function(x) x[which.max(x$Weight),])
        )
#       Date    Exercise Category Weight Reps   EstMax RepxWeight          Note
# 5   4/2/16 Bench Press    Chest    135    2 143.9910      2x135      not hard
# 6   4/9/16 Bench Press    Chest    135    2 143.9910      2x135 a little hard
# 12  5/8/16 Bench Press    Chest    115    4 130.3180      4x115          easy
# 8  4/18/16    Deadlift     Legs    155    8 196.2920      8x155           ...
# 1   4/2/16    Deadlift     Legs    135    7 166.4685      7x135          easy
# 10  5/8/16    Deadlift     Legs    185    3 203.4815      3x185      good day

如果由于某种原因可能在多个Category 中包含一个Exercise，您可能希望by 的第二个参数改为data[c("Date","Exercise","Category")]。

（您可以使用x[order(as.Date(x$Date, format="%m/%d/%Y")),] 之类的内容对输出进行排序...事实上，您可能是想让$Date 列成为实际的Date-class。）

【讨论】：

【解决方案3】：

这是dplyr 的另一种方法：

library(dplyr)
library(lubridate)

data %>%
  mutate(Date = mdy(Date)) %>%
  group_by(Date, Exercise) %>%
  arrange(desc(EstMax)) %>%
  slice(1)

结果：

# A tibble: 6 x 8
# Groups:   Date, Exercise [6]
        Date    Exercise Category Weight  Reps   EstMax RepxWeight          Note
      <date>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>        <fctr>
1 2016-04-02 Bench Press    Chest    135     2 143.9910      2x135      not hard
2 2016-04-02    Deadlift     Legs    135     7 166.4685      7x135          easy
3 2016-04-09 Bench Press    Chest    135     2 143.9910      2x135 a little hard
4 2016-04-18    Deadlift     Legs    155     8 196.2920      8x155             …
5 2016-05-08 Bench Press    Chest    115     4 130.3180      4x115          easy
6 2016-05-08    Deadlift     Legs    185     3 203.4815      3x185      good day

或者你也可以使用sqldf:

library(sqldf)
library(lubridate)

data$Date = mdy(data$Date)

sqldf("select *, max(EstMax) as EstMax2 from data
        group by Date, Exercise
        order by Date, Exercise")

结果：

        Date    Exercise Category Weight Reps   EstMax RepxWeight          Note  EstMax2
1 2016-04-02 Bench Press    Chest    135    2 143.9910      2x135      not hard 143.9910
2 2016-04-02    Deadlift     Legs    135    7 166.4685      7x135          easy 166.4685
3 2016-04-09 Bench Press    Chest    135    2 143.9910      2x135 a little hard 143.9910
4 2016-04-18    Deadlift     Legs    155    8 196.2920      8x155             … 196.2920
5 2016-05-08 Bench Press    Chest    115    4 130.3180      4x115          easy 130.3180
6 2016-05-08    Deadlift     Legs    185    3 203.4815      3x185      good day 203.4815

【讨论】：

我喜欢sqldf 的建议，没想到，谢谢

【解决方案4】：

我知道您更喜欢基本 R 解决方案，但 dplyr 提供了一个函数“top_n”，它完全符合您的要求。

使用一次即可检索 EstMax 的所有实例：

library(dplyr)

data %>%
  group_by(Exercise) %>%
  top_n(1, EstMax)

# A tibble: 5 x 8
# Groups:   Exercise [2]
    Date    Exercise Category Weight  Reps   EstMax RepxWeight          Note
  <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>        <fctr>
1 4/2/16 Bench Press    Chest    135     2 143.9910      2x135      not hard
2 4/9/16 Bench Press    Chest    135     2 143.9910      2x135 a little hard
3 4/9/16 Bench Press    Chest    135     2 143.9910      2x135   super tired
4 5/8/16    Deadlift     Legs    185     3 203.4815      3x185      good day
5 5/8/16    Deadlift     Legs    185     3 203.4815      3x185     felt easy

使用它两次来检索最大结果的第一个结果：

data %>%
  group_by(Exercise) %>%
  top_n(1, EstMax) %>%
  top_n(1, Date)

Selecting by Note
# A tibble: 2 x 8
# Groups:   Exercise [2]
    Date    Exercise Category Weight  Reps   EstMax RepxWeight        Note
  <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>      <fctr>
1 4/9/16 Bench Press    Chest    135     2 143.9910      2x135 super tired
2 5/8/16    Deadlift     Legs    185     3 203.4815      3x185    good day

请注意，这是取第一个结果，不一定是最早的日期。所以你必须在使用第二个'top_n'之前按日期安排：

data %>%
  group_by(Exercise) %>%
  top_n(1, EstMax) %>%
  mutate(Date = as.Date(Date, format = '%d/%m/%y')) %>%
  arrange(Date) %>%
  top_n(1)

Selecting by Note
# A tibble: 2 x 8
# Groups:   Exercise [2]
        Date    Exercise Category Weight  Reps   EstMax RepxWeight        Note
      <date>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>      <fctr>
1 2016-09-04 Bench Press    Chest    135     2 143.9910      2x135 super tired
2 2016-08-05    Deadlift     Legs    185     3 203.4815      3x185    good day

[edit] 稍微误读了这个问题，这是一个提供您要求的输出的解决方案：

data %>%
  group_by(Date, Exercise) %>%
  top_n(1, EstMax) %>%
  top_n(1)

Selecting by Note
# A tibble: 6 x 8
# Groups:   Date, Exercise [6]
     Date    Exercise Category Weight  Reps   EstMax RepxWeight        Note
   <fctr>      <fctr>   <fctr>  <int> <int>    <dbl>     <fctr>      <fctr>
1  4/2/16    Deadlift     Legs    135     7 166.4685      7x135       tired
2  4/2/16 Bench Press    Chest    135     2 143.9910      2x135    not hard
3  4/9/16 Bench Press    Chest    135     2 143.9910      2x135 super tired
4 4/18/16    Deadlift     Legs    155     8 196.2920      8x155           …
5  5/8/16    Deadlift     Legs    185     3 203.4815      3x185    good day
6  5/8/16 Bench Press    Chest    115     4 130.3180      4x115        hard

【讨论】：

这没有给出预期的输出