dplyr 有条件地仅过滤唯一项目答案

【问题标题】：dplyr conditionally filter on only unique itemsdplyr 有条件地仅过滤唯一项目
【发布时间】：2021-03-01 20:49:46
【问题描述】：

我有一个具有 $ID、$Age 和 $Score 特征的数据框。我想将其过滤为分数低于特定值的唯一 ID。对于多个分数低于阈值的 ID，我只想保留最旧的（即最大年龄）。

这是我尝试实现它的方法，但由于循环，它很慢。有没有办法使用 dplyr 或类似的库来加快速度？

#find the indexes of the items below the threshold
idx <- df$Score <= threshold
#select the below threshold rows
df <- df[idx,]
#find the unique IDs
unique_ids <- unique(df$ID)
unique_items <- data.frame(matrix(ncol=3, nrow=length(unique_ids)))
colnames(unique_items) <- colnames(df)
#loop through each unique ID
for(i in 1:length(unique_ids))
  {
    #find all the items that match that unique ID
    my.list <- df[df$ID == unique_ids[i],]
    #find the index of the oldest unique item that is below the threshold
    oldest_idx <- which.max(my.list$Age)
    #assign it the the result dataframe
    unique_items[i,] <- my.list[oldest_idx,]
  }

【问题讨论】：

标签： r filter dplyr

【解决方案1】：

我们也可以使用

df %>%
  filter(Score < 5) %>%
  group_by(ID) %>%
  slice(which.min(Age))

【讨论】：

谢谢。这似乎是最干净的解决方案。只是为了我自己的启迪，slice(which.min(Age)) 将返回最旧的（最大）Age 用于任何唯一的ID，对吗？
@coolhand 对于每个 ID，它返回“年龄”较低的行。如果您想要更旧的，请使用which.max

【解决方案2】：

样本数据

library(dplyr)
set.seed(2021)
df <- tibble(ID=rep(1:2, each=5), Age=sample(10), Score=c(1:5, 3:7))
df
# # A tibble: 10 x 3
#       ID   Age Score
#    <int> <int> <int>
#  1     1     7     1
#  2     1     6     2
#  3     1     9     3
#  4     1     2     4
#  5     1     4     5
#  6     2     8     3
#  7     2    10     4
#  8     2     5     5
#  9     2     1     6
# 10     2     3     7

答案：

df %>%
  filter(Score < 5) %>%
  group_by(ID) %>%
  slice_min(Age) %>%
  ungroup()
# # A tibble: 2 x 3
#      ID   Age Score
#   <int> <int> <int>
# 1     1     2     4
# 2     2     8     3

这里min-score为5，返回5岁以下的最大年龄记录。

【讨论】：

谢谢。 slice_min 似乎是我所缺少的。当我运行它时，length(unique(df$ID)) 的结果与nrow(df) 的结果不同。如果我只保留每个唯一 ID 的一个实例（最旧的），它们不应该相同吗？
我不知道您使用的是什么数据，也不知道给定该数据后您的预期输出是什么。在我的解释中（显然在 akrun 的几乎相同的答案中），似乎没有任何问题。
akrun 的回答不会引起同样的差异，这就是我选择它的原因。我将不得不深入挖掘我的数据以了解原因，但感谢您的回答
下一次，请消除歧义，并通过包含示例数据和给定该数据的预期输出来使其更容易。如果我知道我的随机数据与您的真实数据不太相似，我就不会尝试了。