【问题标题】:Remove duplicates by multiple conditions通过多个条件删除重复项
【发布时间】:2018-04-26 08:03:25
【问题描述】:

我有数据,其中个人(姓名)在蛋阶段类别中多次出现。我希望每个人只有一个样本,但我不只是想保留 R 找到的第一个样本。我想保留该组在所有其他类别中出现最多的那个。希望我的示例有助于说明这一点。

library(tidyverse)
myDF <- read.table(text="Tissue Food Eggphase Name Group
  wb fl after Kia a
  wb fl after Kia c
  wb wf before Kia b
  wb fl before Lucy c
  wb fl after Lucy b
  wb fl after Lucy c
  wb fl yolkdep Jess c
  wb fl yolkdep Betty a
  wb fl yolkdep Betty b", header = TRUE)

我只想保留 Name 出现一次的行,按 Tissue、Food 和 Eggphase 分组,但我想选择 Group 出现在大多数(如果不是所有不同的 eggphase)的行(具有相同的 Tissue 和 Food 组合)。

   #results I want
  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     c
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     c
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

我试过了

one_bird <- myDF %>% 
  distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)

但它只保留第一个条目

  Tissue Food Eggphase  Name Group
1     wb   fl    after   Kia     a
2     wb   wf   before   Kia     b
3     wb   fl   before  Lucy     c
4     wb   fl    after  Lucy     b
5     wb   fl  yolkdep  Jess     c
6     wb   fl  yolkdep Betty     b

关于如何告诉它选择Group 出现在TissueFood 组合中的大多数(如果不是全部)蛋相中的行的任何想法? 在我的示例中,在wbflTissueFood 组合中出现最多的组是cb,但Kia 没有出现在Group b所以c 是一个更好的选择。像这个例子一样,我的数据有重复,这些重复来自不是最常见的Group 组,我如何让它为该行选择下一个最常见的?

我希望我已经说得够明白了。

【问题讨论】:

    标签: r dplyr tidyr tidyverse


    【解决方案1】:

    一种选择是创建一个按“组织”、“食物”、“组”分组的频率列,然后在“n”上执行降序arrange 并使用distinct

    library(dplyr)
    myDF %>%
         group_by(Tissue, Food, Group) %>%
         mutate(n = n()) %>% arrange(Tissue, Food, Eggphase, Name, desc(n)) %>% 
         ungroup %>%
         distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE) %>%
         select(-n)
    

    【讨论】:

    • 谢谢,它走上了正轨,但我的真实数据中的问题是每个Group 有 20 个条目,因为我每次采样时都采样相同的鸟。如果我包括Eggphase,我会更接近我想要的,除了我不知道每个组有什么Eggphase 的覆盖范围。这段代码:来自 Nate 的 eggphaseCount &lt;- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x)))) 是我想要的想法,可能有多个组,最多 4 个,所以我选择哪一个都没关系。
    • @mckisa 没关系。您应该选择最适合您的一种
    【解决方案2】:

    我想这篇文章和答案应该让我有理由学习 dplyr 和 tidyverse,但是由于我已经努力给出一个有效的答案,所以它是:

    myDF <- read.table(text="Tissue Food Eggphase Name Group
      wb fl after Kia a
      wb fl after Kia c
      wb wf before Kia b
      wb fl before Lucy c
      wb fl after Lucy b
      wb fl after Lucy c
      wb fl yolkdep Jess c
      wb fl yolkdep Betty a
      wb fl yolkdep Betty b", header = TRUE)
    
    # I usually have the following setting active: options(stringsAsFactors=F)
    # The following might error without such a setting
    
    # Create a var that indicates a duplicate or a record with a duplicate
    myDF$duplicate <- duplicated(myDF[,c('Name','Eggphase','Tissue','Food')])
    myDF$duplicate <- ifelse(duplicated(myDF[,c('Name','Eggphase','Tissue','Food')],fromLast=T),yes=T, no=myDF$duplicate)
    
    # Count eggphases by group 
    eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x))))
    # Merge to DF
    myDF <- merge(myDF,eggphaseCount,by='Group',all=T)
    
    # Get the max # of egphases by name
    scale <- with(myDF,aggregate(x=list(PhaseMax=Group_phaseCt),by=list(Name=Name),FUN=max))
    # Add to DF
    myDF <- merge(myDF,scale,by='Name',all=T)
    
    # Take the ratio
    myDF$bestRatio <- with(myDF,Group_phaseCt/PhaseMax)
    # Keep only those that aren't a duplicate, or are a duplicate and have the highest ratio
    myDF2 <- myDF[with(myDF,which(duplicate==FALSE | (duplicate==TRUE & bestRatio==1))),]
    

    【讨论】:

    • 这真的很酷。不幸的是,我对这种语法的了解是不存在的。我了解您的代码的作用,但不确定是否可以对其进行编辑。我尝试在我的数据上运行它,但它包括比率为 1 而不仅仅是 1 行的所有行。在这种情况下,我还希望所有选择的行都属于同一组。我试过myDF2$dup2 &lt;- duplicated(myDF2[,c('Name','Eggphase','Tissue','Food')])myDF2 %&gt;% filter(dup2 == FALSE),但它比one_bird &lt;- myDF %&gt;% distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)少了一行,这让我质疑它的准确性。想法?
    • 是的,在另一次检查中,我发现在示例数据中,我的代码并没有消除您提到的组之间的联系。如果我要修复它,我会返回并编辑代码,以便有逻辑来绘制与后面阶段组匹配的组的行或类似的东西。这将使我笨拙的代码变得更长,并且由于您似乎已经找到了解决方案,我将继续前进。恭喜!
    【解决方案3】:

    嘿,谢谢你们的帮助!!您建议的组合似乎有效:

    # Create a var that indicates a duplicate or a record with a duplicate
    myDF$duplicate <- duplicated(myDF[,c('Name','Eggphase','Tissue','Food')])
    #this won't tell you that the first entry og the combination is double
    # so need to make them check against the previous row
    myDF$duplicate <- ifelse(duplicated(myDF[,c('Name','Eggphase','Tissue','Food')],fromLast=T),yes=T, no=myDF$duplicate)
    
    # Count eggphases by group 
    eggphaseCount <- with(myDF,aggregate(x=list(Group_phaseCt=Eggphase),by=list(Group=Group),FUN=function(x) length(unique(x))))
    # Merge to DF
    myDF <- merge(myDF,eggphaseCount,by='Group',all=T)
    
    # Get the max # of egphases by name
    scale <- with(myDF,aggregate(x=list(PhaseMax=Group_phaseCt),by=list(Name=Name),FUN=max))
    # Add to DF
    myDF <- merge(myDF,scale,by='Name',all=T)
    
    # Take the ratio
    myDF$bestRatio <- with(myDF,Group_phaseCt/PhaseMax)
    
    # make new df without duplicates
    myDF2 <- myDF %>% 
    #arrange in a way that the first duplicate is from the group with the most eggphases
    #and the name appears in the most egg phases 
      arrange(Tissue, Food, Eggphase, Name, Group, desc(Group_phaseCt), desc(PhaseMax)) %>% 
    #select only distinct rows according to specified var keep all others
      distinct(Tissue, Food, Eggphase, Name, .keep_all = TRUE)
    

    【讨论】:

      猜你喜欢
      • 2022-08-15
      • 2022-09-28
      • 2018-05-29
      • 2010-11-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-02-17
      • 1970-01-01
      相关资源
      最近更新 更多