如何在从未汇总的列中检索值时按组汇总答案

【问题标题】：How to summarize by group while retrieving values from columns that weren't summarized如何在从未汇总的列中检索值时按组汇总
【发布时间】：2022-01-12 00:09:25
【问题描述】：

我正在尝试汇总数据框，同时按变量分组。我的问题是，在进行这样的汇总过程时，我丢失了我需要的其他列。

考虑以下数据：

df <- 
  tibble::tribble(
    ~id, ~year, ~my_value,
    1,   2010,  2,
    1,   2013,  2,
    1,   2014,  2,
    2,   2010,  4,
    2,   2012,  3,
    2,   2014,  4,
    2,   2015,  2,
    3,   2015,  3,
    3,   2010,  3,
    3,   2011,  3
  )

我想按id 分组，以便将my_value 折叠为单个值。我使用以下算法：

如果my_value 的所有值都相同，则只需返回第一个值，即my_value[1]。
ELSE 返回最小值，即min(my_value)。

所以我写了一个小函数来做到这一点：

my_func <- function(x) {
  if (var(x) == 0) {
    return(x[1])
  }
  # else:
  min(x)
}

现在我可以使用dplyr 或data.table 来总结id：

library(dplyr)
library(data.table)

# dplyr
df %>%
  group_by(id) %>%
  summarise(my_min_val = my_func(my_value))
#> # A tibble: 3 x 2
#>      id my_min_val
#>   <dbl>      <dbl>
#> 1     1          2
#> 2     2          2
#> 3     3          3

# data.table
setDT(df)[, .(my_min_val = my_func(my_value)), by = "id"]
#>    id my_min_val
#> 1:  1          2
#> 2:  2          2
#> 3:  3          3

到目前为止一切顺利。 我的问题是我丢失了 year 值。我想要每个选择的my_value 的相应year 值。

我想要的输出应该是这样的：

# desired output
desired_output <- 
  tribble(~id, ~my_min_val, ~year,
          1,   2,           2010,  # because for id 1, var(my_value) is 0, and hence my_value[1] corresponds to year 2010
          2,   2,           2015,  # because for id 2, var(my_value) is not 0, and hence min(my_value) (which is 2) corresponds to year 2015
          3,   3,           2015)  # because for id 3, var(my_value) is 0, hence my_value[1] corresponds to year 2015

我特别寻求data.table 解决方案，因为我的真实数据非常大（超过 100 万行）并且包含许多组。因此效率很重要。谢谢！

【问题讨论】：

相关：Extract row corresponding to minimum value of a variable by group

标签： r dplyr data.table

【解决方案1】：

我们可以使用slice中的条件

library(dplyr)
my_func <- function(x) if(var(x) == 0) 1 else which.min(x)
df %>% 
   group_by(id) %>% 
   slice(my_func(my_value)) %>%
   ungroup

-输出

# A tibble: 3 × 3
     id  year my_value
  <dbl> <dbl>    <dbl>
1     1  2010        2
2     2  2015        2
3     3  2015        3

或者使用data.table

library(data.table)
setDT(df)[df[, .I[my_func(my_value)], id]$V1]
   id year my_value
1:  1 2010        2
2:  2 2015        2
3:  3 2015        3

或者slice_min和with_ties = FALSE

df %>%
    group_by(id) %>% 
    slice_min(n = 1, order_by = my_value, with_ties = FALSE)  %>%
    ungroup

-输出

# A tibble: 3 × 3
     id  year my_value
  <dbl> <dbl>    <dbl>
1     1  2010        2
2     2  2015        2
3     3  2015        3

【讨论】：

谢谢。当var(my_value) == 0（而不仅仅是min()）因为我需要场景中第一个元素的相应年份时，获取 first 元素对我来说很重要其中所有元素都是相同的。
@Emman 更新怎么样
是的！第一个似乎很完美。我不确定第二个选项。第二种方式是否遵循相同的条件？
哇，你太棒了。非常感谢。
var() 受到限制，是的，我知道。我想我会选择data.table::uniqueN(x) == 1