【问题标题】:R select one row from duplicated rows after compare multi conditionsR在比较多个条件后从重复行中选择一行
【发布时间】:2015-09-18 00:36:02
【问题描述】:

我从大量数据中获得了这些重复的记录。现在,我需要从这些重复的行中选择一行。

ID <- c("6820","6820","17413","17413","38553","38553","52760","52760","717841","717841","717841","747187","747187","747187")
date <- c("2014-06-12","2015-06-11","2014-05-01","2014-05-01","2014-06-12","2015-06-11","2014-10-24","2014-10-24","2014-05-01","2014-05-01","2014-12-02","2014-03-01","2014-05-12","2014-05-12")
type <- c("ST","ST","MC","MC","LC","LC","YA","YA","YA","YA","MC","LC","LC","MC")
level <-c("firsttime","new","new","active","active","active","firsttime","new","active","new","active","new","active","active")
data <- data.frame(ID,date,type,level)

数据框将如下所示:

我想比较一下:对于每个 ID,如果它们的日期不同,则将它们全部保存在 df.right 中;如果日期相同,则比较类型,按照LC>MC>YA>ST的顺序选择它们(例如选择MC而不是YA),将它们放入df.right;如果类型相同,则比较级别,按活动>新>第一次的顺序选择它们(例如,第一次选择新的),然后将选择的放入df.right。

我尝试使用 foreach,这只是第一步,它不适用于 ID 有 3 个重复行。

foreach (i=unique(data$ID), .combine='rbind') %do% {data[data$ID==i, "date"][1] == data[data$ID==i, "date"][2])
b <- data[data$ID==i,]}

结果应该是这样的: 有人知道该怎么做吗?真的很感激。谢谢

【问题讨论】:

    标签: r


    【解决方案1】:

    dplyr 包很适合这种事情

    使用因子,您可以指定您希望类别排序的方式。然后,您可以为每个唯一 ID/日期对选择每种类型和级别中的第一个。

    library(dplyr)
    
    ID <- c("6820","6820","17413","17413","38553","38553","52760","52760","717841","717841","717841","747187","747187","747187")
    date <- c("2014-06-12","2015-06-11","2014-05-01","2014-05-01","2014-06-12","2015-06-11","2014-10-24","2014-10-24","2014-05-01","2014-05-01","2014-12-02","2014-03-01","2014-05-12","2014-05-12")
    type <- c("ST","ST","MC","MC","LC","LC","YA","YA","YA","YA","MC","LC","LC","MC")
    level <-c("firsttime","new","new","active","active","active","firsttime","new","active","new","active","new","active","active")
    
    type <- factor(type, levels=c("LC", "MC", "YA", "ST"))
    
    level <- factor(level, levels=c("active", "new", "firsttime"))
    
    data <- data.frame(ID,date,type,level)
    
    df.right <- data %>%
      group_by(ID, date) %>%
      filter(type == sort(type)[1]) %>%
      filter(level == sort(level)[1])
    

    【讨论】:

    • 我不认为这个答案会产生正确的输出
    • @pcantalupo 它与 OP 的样本输出不完全匹配,但我认为 OP 的样本输出不正确,因为在第 13 行和第 14 行之间,应该保留第 13 行(而不是 14),因为 LC优先于 MC
    • 嗯,我想知道为什么事情看起来不太对劲。
    • 这很优雅;我建议的改进是使用arrange 按类型和级别排序,然后使用top_n 拉出顶部元素;所以在group_by之后,就是arrange(type, level) %&gt;% top_n(1)
    【解决方案2】:

    这里的诀窍是根据需要对typelevel 的级别进行排序。然后需要进行两次重复数据删除:首先,根据ID, date, type 列删除重复行,其次,根据前两列删除重复行:

    type = factor(type, levels=c("ST","YA","MC","LC"))
    level = factor(level, levels=c("active","new","firsttime"))
    data <- data.frame(ID,date,type,level)
    
    d = with(data, data[order(ID, date, type, level),])
    e = d[-which(duplicated(d[,1:3])),]
    df.right = e[-which(duplicated(e[,1:2])),]
    df.right = df.right[order(as.numeric(as.character(df.right$ID)), df.right$date),]
    df.right
    

    输出:

           ID       date type     level
    1    6820 2014-06-12   ST firsttime
    2    6820 2015-06-11   ST       new
    4   17413 2014-05-01   MC    active
    5   38553 2014-06-12   LC    active
    6   38553 2015-06-11   LC    active
    8   52760 2014-10-24   YA       new
    9  717841 2014-05-01   YA    active
    11 717841 2014-12-02   MC    active
    12 747187 2014-03-01   LC       new
    14 747187 2014-05-12   MC    active
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-03-19
      • 2019-03-04
      • 2015-07-08
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多