如何删除 R [重复] 中的重复行答案

【问题标题】：How can I remove the duplicate rows in R [duplicate]如何删除 R [重复] 中的重复行
【发布时间】：2021-05-22 04:25:24
【问题描述】：

在我的df中，我定义c('apple', 'banana')和c('banana', 'apple')是一样的，因为水果类型是一样的只是排列不同。

那么，如何删除第一行和第二行，只保留最后一行（wanted_df）。

df = data.frame(fruit1 = c('apple', 'banana', 'fig'),
                fruit2 = c('banana', 'apple', 'cherry'))
df

wanted_df = df[3,]

任何帮助将不胜感激！

=============================

我的真实数据有问题。

frames2 丢失了 lag = 2 的行。我想要像 wanted_frames 这样的数据框。

pollution1 = c('pm2.5', 'pm10', 'so2', 'no2', 'o3', 'co')
pollution2 = c('pm2.5', 'pm10', 'so2', 'no2', 'o3', 'co') 
dis = 'n'
lag = 1:2

frames = expand.grid(pollution1 = pollution1, 
                     pollution2 = pollution2,
                     dis = dis, 
                     lag = lag) %>% 
  mutate(pollution1 = as.character(pollution1),
         pollution2 = as.character(pollution2), 
         dis = as.character(dis)) %>% 
  as_tibble() %>% 
  filter(pollution1 != pollution2)

vec<- with(frames, paste(pmin(pollution1, pollution2), pmax(pollution1, pollution2)))

frames2 = frames[!duplicated(vec), ]

wanted_frames = frames2 %>% mutate(lag = 2) %>% bind_rows(frames2)

【问题讨论】：

你能显示预期的输出吗？如果您只是展示了一个手动示例，您希望 frames2 的显示方式如何。
@cmirian，嗨，最后一个代码wanted_frames 是我的预期输出。
pollution1 和 pollution2 是相同的。因此，如果您应用省略重复项的filter，您最终会得到零行。我不完全确定您要达到的目标。

标签： r dplyr

【解决方案1】：

试试这个。

library(dplyr)
d <- filter(df, !(fruit1 %in% fruit2) | !(fruit2 %in% fruit1))

这给了

> d
  fruit1 fruit2
1    fig cherry

更新

正如@JonSpring 和@Phil 所评论的，更新后的代码应该是

df %>% rowwise() %>% filter(!(fruit1 %in% fruit2) | !(fruit2 %in% fruit1))%>% ungroup()

【讨论】：

这么简单的想法。不应该是filter(df, !(fruit1 %in% fruit2) | !(fruit2 %in% fruit1))吗？
当然，谢谢@Phil - 已相应更新。周末愉快。
我不相信这适用于所有情况，例如df = data.frame(fruit1 = c('apple', 'cherry', 'banana', 'fig'), fruit2 = c('banana', 'apple', 'apple', 'cherry'))。在这种情况下，第 2 行是一个独特的组合，但会被过滤掉，因为其中一个元素在另一行的另一列中找到。
@JonSpring 是正确的 - 应该用 df %>% rowwise() %>% filter(...) %>% ungroup() 修复，但它可能会使其变慢。

【解决方案2】：

基本 R 方式：

vec<- with(df, paste(pmin(fruit1, fruit2), pmax(fruit1, fruit2)))
df[!(duplicated(vec) | duplicated(vec, fromLast = TRUE)), ]

#   fruit1 fruit2
#3    fig cherry

【讨论】：

@Ronak Shah，感谢您的回复，但是当我在真实数据中使用您的方法时出现问题，我更新了我的问题。
@zhiweili 1）您没有使用我答案的完整代码。 2）对于您的共享数据框，所有值都是重复的，因此所有内容都从数据中删除。

【解决方案3】：

这是一种低技术含量的 dplyr 方法。创建一个排序键，然后使用唯一键保留行。

library(dplyr)
df %>%
    mutate(key = paste(pmin(fruit1, fruit2), pmax(fruit1, fruit2))) %>%
    add_count(key) %>%
    filter(n == 1)

【讨论】：