删除特定列中具有特定值的重复行答案

【问题标题】：Remove duplicate rows with certain value in specific column删除特定列中具有特定值的重复行
【发布时间】：2020-07-08 09:04:58
【问题描述】：

我有一个数据框，我想删除除一列之外的所有列中重复的行，并选择保留那些不是特定值的行。

在上面的示例中，除 col3 之外的所有列的第 3 行和第 4 行都是重复的，所以我只想保留一行。复杂的步骤是我想保留第 4 行而不是第 3 行，因为 col3 中的第 3 行被“排除”。一般来说，我只想保留没有“排除”的行（重复的）。

我的真实数据框有很多重复的行，在这 2 行重复的行中，其中之一肯定是“排除”的。

以下是可重现的前：

a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d <- data.frame(a,b,c)

非常感谢！

更新：或者，删除重复项时，仅保留第二个观察值（第 4 行）。

【问题讨论】：

标签： r dataframe

【解决方案1】：

带有一些基础 R 的 dplyr 应该可以解决这个问题：

 library(dplyr) 
 a <- c(1,2,3,3,3,7)
 b <- c(4,5,6,6,6,8)
 c <- c("red","green","brown","excluded","orange","excluded")
 d <- data.frame(a,b,c)

 d <- filter(d, !duplicated(d[,1:2]) | c!="excluded")

Result: 
  a b        c
1 1 4      red
2 2 5    green
3 3 6    brown
4 3 6   orange
5 7 8 excluded

过滤器将删除任何应排除且不重复的内容。我也向您的示例（'brown'）添加了一个非唯一排除示例以进行测试。

【讨论】：

好的，我更新了解决方案。 duplicated 函数用于为重复项创建逻辑向量，然后我们保留未重复或未排除的任何内容。我对您的数据进行了测试，得到了预期的结果。

【解决方案2】：

这是一个带有循环的示例：

a <- c(1,2,3,3,7)
b <- c(4,5,6,6,8)
c <- c("red","green","excluded","orange","excluded")
d<- data.frame(a,b,c)

# Give row indices of duplicated rows (only the second and more occurence are given)
duplicated_rows=which(duplicated(d[c("a","b")]))

to_remove=c()
# Loop over different duplicated rows
for(i in duplicated_rows){
  # Find simmilar rows
  selection=which(d$a==d$a[i] & d$b==d$b[i])
  # Sotre indices of raw in the set of duplicated row whihc are "excluded"
  to_remove=c(to_remove,selection[which(d$c[selection]=="excluded")])
}

# Remove rows
d=d[-to_remove,]

print(d)

> a b       c
> 1 4      red
> 2 2 5   green
> 4 3 6   orange
> 5 7 8  excluded

【讨论】：

该代码适用于该示例。但是，我有大数据框（260 万行 * 35 列），所以我想避免 for 循环。你能提出另一种方式吗？非常感谢！

【解决方案3】：

这是一种可能性......我希望它可以帮助:)

nquit <- (d %>%
  mutate(code= 1:nrow(d)) %>%
  group_by(a, b) %>%
  mutate(nDuplicate= n()) %>%
  filter(nDuplicate > 1) %>%
  filter(c == "excluded"))$code

e <- d[-nquit]

【讨论】：

【解决方案4】：

通过@Klone 缩短方法，另一种 dplyr 解决方案：

d %>% mutate(c = factor(c, ordered = TRUE, 
                        levels = c("red", "green", "orange", "excluded"))) %>% # Order the factor variable
  arrange(c) %>% # Sort the data frame so that excluded comes first
  group_by(a, b) %>% # Group by the two columns that determine duplicates
  mutate(id = 1:n()) %>% # Assign IDs in each group
  filter(id == 1) # Only keep one row in each group

结果：

# A tibble: 4 x 4
# Groups:   a, b [4]
      a     b c           id
  <dbl> <dbl> <ord>    <int>
1     1     4 red          1
2     2     5 green        1
3     3     6 orange       1
4     7     8 excluded     1

【讨论】：

【解决方案5】：

关于您在问题末尾的编辑：

更新：或者，删除重复项时，仅保留第二个观察值（第 4 行）。

注意，如果col3 对行的排序确定要保留的行始终是重复记录中的最后一行，您可以简单地在@ 中设置fromLast=TRUE 987654323@ 函数请求应将行标记为重复从找到的最后一个重复计数开始每个重复组。

使用稍微修改过的数据版本（我添加了更多重复的组以更好地表明该过程在更一般的情况下有效）：

a <- c(1,1,2,3,3,3,7)
b <- c(4,4,5,6,6,6,8)
c <- c("excluded", "red","green","excluded", "excluded","orange","excluded")
d <- data.frame(a,b,c)

  a b        c
1 1 4 excluded
2 1 4      red
3 2 5    green
4 3 6 excluded
5 3 6 excluded
6 3 6   orange
7 7 8 excluded

使用：

ind2remove = duplicated(d[,c("a", "b")], fromLast=TRUE)
(d_noduplicates = d[!ind2remove,])

我们得到：

  a b        c
2 1 4      red
3 2 5    green
6 3 6   orange
7 7 8 excluded

请注意，这并不要求每个重复组中的行都在原始数据中。唯一重要的是您希望保持记录在每个重复组的数据中最后显示。

【讨论】：