根据条件从每组的列中获取行答案

【问题标题】：Get rows from a column per group based on a condition根据条件从每组的列中获取行
【发布时间】：2020-11-30 06:34:33
【问题描述】：

我有一个data.frame，如下图：

基本要求是在每组某个日期之后找到“n”个“值”的平均值。

例如：，用户提供：

Certain Date = Failure Date

n = 4

因此，对于A，平均值为(60+70+80+100)/4；忽略NAs

对于B，平均值为(80+90+100)/3。注意B，n=4 不会发生，因为在满足条件failuredate = valuedate 之后只有 3 个值。

这里是dput：

structure(list(Name = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", 
"B"), class = "factor"), FailureDate = structure(c(1L, 1L, 1L, 
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
2L), .Label = c("1/5/2020", "1/7/2020"), class = "factor"), ValueDate = structure(c(1L, 
3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 2L, 1L, 3L, 4L, 5L, 6L, 7L, 
8L, 9L, 10L, 2L), .Label = c("1/1/2020", "1/10/2020", "1/2/2020", 
"1/3/2020", "1/4/2020", "1/5/2020", "1/6/2020", "1/7/2020", "1/8/2020", 
"1/9/2020"), class = "factor"), Value = c(10L, 20L, 30L, 40L, 
NA, 60L, 70L, 80L, NA, 100L, 10L, 20L, 30L, 40L, 50L, 60L, 70L, 
80L, 90L, 100L)), class = "data.frame", row.names = c(NA, -20L
))

【问题讨论】：

标签： r dataframe tidyverse

【解决方案1】：

我们可以在按“名称”分组后使用 cumsum 创建一个索引，提取“值”元素并获取平均值

library(dplyr)
n <- 4
df1 %>%
   type.convert(as.is = TRUE) %>% 
   group_by(Name) %>% 
   summarise(Ave = mean(head(na.omit(Value[lag(cumsum(FailureDate == ValueDate),
        default = 0) > 0]), n), na.rm = TRUE))
# A tibble: 2 x 2
#  Name    Ave
#  <chr> <dbl>
#1 A      77.5
#2 B      90

【讨论】：

@akrun 很棒。此处唯一需要注意的是，如果 n=3 的值，则每组的平均值应仅考虑自 FailureDate==ValueDate 以来的前 3 个值。在您的解决方案中，它会在满足条件 FailureDate==ValueDate 后从所有行中创建平均值。
如果 OP 仅表示 4 个值，那么它应该是 head。根据示例，我不确定，因为 4 是两组中的最大值。可能是您建议的 OP
@user11397513 谢谢，我更新了head
完美。我在这里..test1 % group_by(Name)%>% filter(as.Date(ValueDate) > as.Date(last(FailureDate)))%>%变异（均值 = 均值（头部（值，4）））

【解决方案2】：

您可以将因子日期转换为Date 对象，然后计算每组FailureDate 之后的“n”个数字的平均值。请注意，“n”个数字应排除NA，因此此处使用tidyr::drop_na()。

library(dplyr)

df %>%
  mutate(across(contains("Date"), as.Date, "%m/%d/%Y")) %>%
  tidyr::drop_na(Value) %>% 
  group_by(Name) %>%
  summarise(mean = mean(Value[ValueDate > FailureDate][1:4], na.rm = T))

# # A tibble: 2 x 2
#   Name   mean
#   <fct> <dbl>
# 1 A      77.5
# 2 B      90

【讨论】：

【解决方案3】：

你可以试试这个：

library(dplyr)

n <- 4

df %>%
  mutate(condition = as.character(FailureDate) == as.character(ValueDate))
  group_by(Name) %>%
  mutate(condition = cumsum(condition)) %>%
  filter(condition == 1) %>%
  slice(-1) %>%
  filter(!is.na(Value)) %>%
  slice(1:n) %>%
  summarise(mean_col = mean(Value))

> df

# A tibble: 2 x 2
  Name  mean_col
  <fct>    <dbl>
1 A         77.5
2 B         90

【讨论】：