R：在 DPLYR 中应用 3 步逻辑进行重复删除答案

【问题标题】：R: applying 3-step logic for duplicate removal in DPLYRR：在 DPLYR 中应用 3 步逻辑进行重复删除
【发布时间】：2018-06-18 15:48:46
【问题描述】：

我完全不知道如何根据多个字符串变量的值过滤重复项。可悲的是，我的数据集是私有的，但我可以用假数据一瞥它：

id = c(1, 1, 2, 2, 5, 6, 6)
car = c(0, 1, 1, 1, 1, 1, 1) 
insurance = c("no", "yes", "yes", "yes", "no", "yes", "yes")
ins_type = c("", "liab", "liab", "full", "", "full", "liab")
df = data.frame(id, car, insurance, ins_type)`

构建这个data.frame：

id car insurance ins_type`
 1   0        no
 1   1       yes     liab
 2   1       yes     liab
 2   1       yes     full
 5   1        no 
 6   1       yes     full
 6   1       yes     liab

地点：

a. id = person
b. car = 0 is NO and 1 is YES
c. insurance = whether or not that person has one, and  
d, ins_type = liability or full

我需要删除所有重复的个人。我想要的数据集是：

在数据集中出现一次，无论是否拥有汽车；
拥有汽车的人，最好是拥有汽车的人；
有保险的话最好是那些；
有完整的保险。

即：

id car insurance ins_type
 1   1       yes     liab
 2   1       yes     full
 5   1        no 
 6   1       yes     full

请注意，5 必须保留，因为它只出现一次。删除了所有重复项。人 #1 有两个连接，但只有一个基于拥有车辆，所以保留了。

我有以下 dplyr 代码：

df = df %>%
    group_by(id) %>%
    filter(car == 1) %>%
    filter(insurance == "yes") %>%
    filter(ins_type == "full")

但这会导致：

id   car insurance ins_type
 2      1       yes     full
 6      1       yes     full

我也试过

df %>% group_by(id, car) %>% distinct(insurance)

但这会导致

id   car insurance
 1     0        no
 1     1       yes
 2     1       yes
 5     1        no
 6     1       yes

第一行不应该在那里。

我对这个主题进行了广泛的搜索，并为“如何有条件地过滤重复行”这个问题找到了许多答案。其中大多数——例如this 和this——处理保留具有最高或最低值的行之一。其他人处理任意/随机过滤。我需要按照上面的逻辑。

非常欢迎任何见解。

编辑

以下所有答案都非常令人满意，并以自己的方式解决了问题。我投给了@storaged 的一票，因为我的问题解决方案的核心是使用因子级别来创建层次结构。感谢您的帮助和教导，希望有一天我能对您或社区有所帮助。

【问题讨论】：

你能添加你想要的输出吗？
我认为第四个框是所需的输出
ID 为 :- 1, 2, 5, 6 的那个？
@suchait：是的，这是第四个盒子。它保持： id=5，谁出现一次；湾。 id=1的第二个入口，就是有车的那个； C。 id=2 和 id=6 的条目是完全保险的汽车。谢谢。
您只需要 dplyr 中的解决方案，或者使用 data.table 就可以了吗？

标签： r duplicates dplyr

【解决方案1】：

我提出以下解决方案。首先通过提供适当的分级来照顾每个字段的重要性。在你的例子中，我们这样做

df$ins_type <- factor(df$ins_type, levels=c("", "liab", "full"))

其他因素的水平顺序很好。接下来我们可以对所有字段进行排序并选择组中的最后一个条目

df %>% group_by(id) 
   %>% arrange(sort(car), sort(insurance), sort(ins_type)) 
   %>% do(tail(.,n=1))

但感觉可能存在更优雅的解决方案

编辑

如果有更多的列名，您可以执行以下操作，而不是手动编写它们

df %>% group_by(id)
   %>% arrange_(.dots=paste0("sort(", names(df)[-1],")")) 
   %>% do(tail(.,n=1))

【讨论】：

谢谢，@storaged。我会测试并回帖。
嗨，你的代码运行良好，@storaged。现在我需要用实际的数据集对其进行测试。有更多的变量和更多的层次。也感谢您的编辑，这将有助于解决这个问题。请给我一些时间回帖。
嗨，我已将您的方法用于我的实际数据——实际上有 10 多个类别——并且效果很好。谢谢一百万！

【解决方案2】：

使用data.table:-

library(data.table)
setDT(df)
df[, idx := .N, by = id]
df <- df[!(idx == 2 & car == 0), ]
df[, idx := .N, by = id]
df <- df[!(idx == 2 & ins_type == "liab"), ]
df[, idx := NULL]
df

你会得到你想要的输出：-

id car insurance ins_type
1:  1   1       yes     liab
2:  2   1       yes     full
3:  5   1        no         
4:  6   1       yes     full

这是我在dplyr中尝试过的东西：-

df <- df %>%
  group_by(id) %>%
  mutate(idx = n()) %>%
  filter((idx == 2 | idx == 1) & car == 1) %>%
  mutate(idx1 = n())


df %>%
  filter(!(idx1 == 2 & ins_type == "liab")) %>%
  select(-one_of(c("idx", "idx1")))

它给出相同的输出：-

 # A tibble: 4 x 4
# Groups:   id [4]
     id   car insurance ins_type
  <dbl> <dbl>    <fctr>   <fctr>
1     1     1       yes     liab
2     2     1       yes     full
3     5     1        no         
4     6     1       yes     full

【讨论】：

非常感谢@suchait。我会测试并回帖。
好吧，@suchait，您的代码运行良好。如果我理解正确（对不起，这里是初学者 DT ......），你有： 1. 创建了一个 idx 变量，其中包含每个 ID 的出现次数； 2. 删除一次显示两次无车的人； 3. 再次创建 idx 以便进一步复制； 4. 如果一个是“liab”，则删除有 2 辆车的人； 5.删除了IDX变量。对吗？
嗯，非常感谢。我将针对一个人没有车但在数据集中显示一次的情况进行测试。在我选择解决方案之前，请给我一点时间（以及其他答案）。但我感谢您的帮助并教我一些 DT。 :)
当然。也请查看dplyr 解决方案。
非常感谢您的帮助以及 data.table 代码和课程。我已经决定使用 storaged 的解决方案并在上面进行了解释。但你的工作很好，非常感谢。

【解决方案3】：

这是@storaged 答案的扩展，但都在dplyr 链中

df %>% 
   mutate(ins_type = relevel(ins_type, "liab")) %>% 
   group_by(id) %>% 
   arrange(car, insurance, ins_type) %>%      # sort and arrange are redundant
   slice(n())    # equivalent to do(tail(., 1))

【讨论】：

谢谢，@Cpak。我会测试并回帖。 relevel 可以用于订购所有级别 - 或一次超过一个级别吗？例如：relevel(ins_type, "liab", "full")，这样我会得到“1 Liab”、“2 full”和“3 NA”？从文档中我没有看到，使用ifelse 也不起作用。
看起来不像。来自文档，relevel(x, ref, …)，**ref** the reference level, typically a string.，（但我还没有测试过）。
亲爱的@Cpak，感谢您的帮助和简洁的代码。我投票支持 storaged 的解决方案，因为我必须采用他的方法来设置关卡。但是您的解决方案非常优雅，并且完美地解决了提出的问题。非常感谢您的帮助。