通过多个条件删除重复项答案

【问题标题】：Remove duplicates by multiple conditions通过多个条件删除重复项
【发布时间】：2018-04-26 08:03:25
【问题描述】：

我有数据，其中个人（姓名）在蛋阶段类别中多次出现。我希望每个人只有一个样本，但我不只是想保留 R 找到的第一个样本。我想保留该组在所有其他类别中出现最多的那个。希望我的示例有助于说明这一点。

library(tidyverse)
myDF <- read.table(text="Tissue Food Eggphase Name Group
  wb fl after Kia a
  wb fl after Kia c
  wb wf before Kia b
  wb fl before Lucy c
  wb fl after Lucy b
  wb fl after Lucy c
  wb fl yolkdep Jess c
  wb fl yolkdep Betty a
  wb fl yolkdep Betty b", header = TRUE)

我只想保留 Name 出现一次的行，按 Tissue、Food 和 Eggphase 分组，但我想选择 Group 出现在大多数（如果不是所有不同的 eggphase）的行（具有相同的 Tissue 和 Food 组合）。

   #results I want
  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     c
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     c
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

我试过了

one_bird <- myDF %>% 
  distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)

但它只保留第一个条目

  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     a
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     b
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

关于如何告诉它选择Group 出现在TissueFood 组合中的大多数（如果不是全部）蛋相中的行的任何想法？在我的示例中，在wb 和fl 的Tissue 和Food 组合中出现最多的组是c 和b，但Kia 没有出现在Group b所以c 是一个更好的选择。像这个例子一样，我的数据有重复，这些重复来自不是最常见的Group 组，我如何让它为该行选择下一个最常见的？

我希望我已经说得够明白了。

【问题讨论】：

标签： r dplyr tidyr tidyverse

【解决方案1】：

一种选择是创建一个按“组织”、“食物”、“组”分组的频率列，然后在“n”上执行降序arrange 并使用distinct

library(dplyr)
myDF %>%
     group_by(Tissue, Food, Group) %>%
     mutate(n = n()) %>% arrange(Tissue, Food, Eggphase, Name, desc(n)) %>% 
     ungroup %>%
     distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE) %>%
     select(-n)

【讨论】：

谢谢，它走上了正轨，但我的真实数据中的问题是每个Group 有 20 个条目，因为我每次采样时都采样相同的鸟。如果我包括Eggphase，我会更接近我想要的，除了我不知道每个组有什么Eggphase 的覆盖范围。这段代码：来自 Nate 的 eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x)))) 是我想要的想法，可能有多个组，最多 4 个，所以我选择哪一个都没关系。
@mckisa 没关系。您应该选择最适合您的一种

【解决方案2】：

我想这篇文章和答案应该让我有理由学习 dplyr 和 tidyverse，但是由于我已经努力给出一个有效的答案，所以它是：

myDF <- read.table(text="Tissue Food Eggphase Name Group
  wb fl after Kia a
  wb fl after Kia c
  wb wf before Kia b
  wb fl before Lucy c
  wb fl after Lucy b
  wb fl after Lucy c
  wb fl yolkdep Jess c
  wb fl yolkdep Betty a
  wb fl yolkdep Betty b", header = TRUE)

# I usually have the following setting active: options(stringsAsFactors=F)
# The following might error without such a setting

# Create a var that indicates a duplicate or a record with a duplicate
myDF$duplicate <- duplicated(myDF[,c('Name','Eggphase','Tissue','Food')])
myDF$duplicate <- ifelse(duplicated(myDF[,c('Name','Eggphase','Tissue','Food')],fromLast=T),yes=T, no=myDF$duplicate)

# Count eggphases by group 
eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x))))
# Merge to DF
myDF <- merge(myDF,eggphaseCount,by='Group',all=T)

# Get the max # of egphases by name
scale <- with(myDF,aggregate(x=list(PhaseMax=Group_phaseCt),by=list(Name=Name),FUN=max))
# Add to DF
myDF <- merge(myDF,scale,by='Name',all=T)

# Take the ratio
myDF$bestRatio <- with(myDF,Group_phaseCt/PhaseMax)
# Keep only those that aren't a duplicate, or are a duplicate and have the highest ratio
myDF2 <- myDF[with(myDF,which(duplicate==FALSE | (duplicate==TRUE & bestRatio==1))),]

【讨论】：

这真的很酷。不幸的是，我对这种语法的了解是不存在的。我了解您的代码的作用，但不确定是否可以对其进行编辑。我尝试在我的数据上运行它，但它包括比率为 1 而不仅仅是 1 行的所有行。在这种情况下，我还希望所有选择的行都属于同一组。我试过myDF2$dup2 <- duplicated(myDF2[,c('Name','Eggphase','Tissue','Food')])myDF2 %>% filter(dup2 == FALSE)，但它比one_bird <- myDF %>% distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)少了一行，这让我质疑它的准确性。想法？
是的，在另一次检查中，我发现在示例数据中，我的代码并没有消除您提到的组之间的联系。如果我要修复它，我会返回并编辑代码，以便有逻辑来绘制与后面阶段组匹配的组的行或类似的东西。这将使我笨拙的代码变得更长，并且由于您似乎已经找到了解决方案，我将继续前进。恭喜！

【解决方案3】：

嘿，谢谢你们的帮助！！您建议的组合似乎有效：

# Create a var that indicates a duplicate or a record with a duplicate
myDF$duplicate <- duplicated(myDF[,c('Name','Eggphase','Tissue','Food')])
#this won't tell you that the first entry og the combination is double
# so need to make them check against the previous row
myDF$duplicate <- ifelse(duplicated(myDF[,c('Name','Eggphase','Tissue','Food')],fromLast=T),yes=T, no=myDF$duplicate)

# Count eggphases by group 
eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x))))
# Merge to DF
myDF <- merge(myDF,eggphaseCount,by='Group',all=T)

# Get the max # of egphases by name
scale <- with(myDF,aggregate(x=list(PhaseMax=Group_phaseCt),by=list(Name=Name),FUN=max))
# Add to DF
myDF <- merge(myDF,scale,by='Name',all=T)

# Take the ratio
myDF$bestRatio <- with(myDF,Group_phaseCt/PhaseMax)

# make new df without duplicates
myDF2 <- myDF %>% 
#arrange in a way that the first duplicate is from the group with the most eggphases
#and the name appears in the most egg phases 
  arrange(Tissue, Food, Eggphase, Name, Group, desc(Group_phaseCt), desc(PhaseMax)) %>% 
#select only distinct rows according to specified var keep all others
  distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)

【讨论】：