【问题标题】:R: How to delete rows that differ from another row in just one (two, three) column?R:如何删除仅一(二、三)列中与另一行不同的行?
【发布时间】:2018-07-05 21:05:09
【问题描述】:

我有一个类似于以下示例的数据框。有时,行包含与另一行相同的对象信息,除了一个(或多个)列包含“NA”。我只想要包含尽可能多信息的行,所以我想删除所有包含“NA”但与另一行具有相同信息的行。 “NA”可能在 C 或 D 列或两者中(从不在 A 或 B 中)。如果没有“更准确”的行,则必须保留包含“NA”的行。

我已经尝试过使用 for 循环(参见示例)并且它有效,第 1 行和第 6 行将被删除。但是,我必须对其进行调整以检查 C 列,并且在我的真实数据中我有更多列,因此还有更多可能的组合,这使得该解决方案不切实际并且可能导致错误。

有没有其他方法可以轻松解决这个问题? 谢谢!

df <- rbind(data.frame(A = "obj1", B = "1", C = "2", D = "NA"), 
            data.frame(A = "obj1", B = "1", C = "2", D = "3"),
            data.frame(A = "obj2", B = "1", C = "NA", D = "3"),
            data.frame(A = "obj2", B = "1", C = "2", D = "3"),
            data.frame(A = "obj2", B = "3", C = "2", D = "3"),
            data.frame(A = "obj2", B = "3", C = "2", D = "NA"),
            data.frame(A = "obj3", B = "2", C = "4", D = "6"),
            data.frame(A = "obj4", B = "2", C = "NA", D = "NA"))

toBeDeleted <- c(55)

for (i in 1:nrow(df)){
  thisRow <- df[i,]

  if (thisRow$D == "NA"){
    for (j in i:nrow(subset(df, A == thisRow$A))){
      anotherRow <- df[j,]
      if (anotherRow$A == thisRow$A & anotherRow$B == thisRow$B 
          & anotherRow$C == thisRow$C & anotherRow$D != thisRow$D){
        toBeDeleted <- c(toBeDeleted,i)
      }
    }
  }
}

df2 <- df[-toBeDeleted,]

【问题讨论】:

    标签: r


    【解决方案1】:

    我们可以使用duplicated(df[1:2])duplicated(df[1:2], fromLast = TRUE)rowSums(is.na(df)) &gt; 0 的组合来排除所有具有NA 且重复的行:

    df <- rbind(data.frame(A = "obj1", B = "1", C = "2", D = NA), 
                data.frame(A = "obj1", B = "1", C = "2", D = "3"),
                data.frame(A = "obj2", B = "1", C = NA, D = "3"),
                data.frame(A = "obj2", B = "1", C = "2", D = "3"),
                data.frame(A = "obj2", B = "3", C = "2", D = "3"),
                data.frame(A = "obj2", B = "3", C = "2", D = NA),
                data.frame(A = "obj3", B = "2", C = "4", D = "6"),
                data.frame(A = "obj4", B = "2", C = NA, D = NA))
    
    df[!((duplicated(df[1:2]) | duplicated(df[1:2], fromLast = TRUE)) & rowSums(is.na(df)) > 0),]
    
         A B    C    D
    2 obj1 1    2    3
    4 obj2 1    2    3
    5 obj2 3    2    3
    7 obj3 2    4    6
    8 obj4 2 <NA> <NA>
    

    它是一个简单的子集,因此不需要循环,即使有大量数据也非常快。它的工作原理是这样的:

    我们将数据称为df[] 并用!() 排除在前两列df[1:2] 上重复且至少有一个NArowSums(is.na(df)) &gt; 0 的所有行。为此,您的数据中需要真正的NA,而不是上面示例数据中的character "NA"。如果您只有"NA",请改用rowSums(df == "NA") &gt; 0

    【讨论】:

    • 感谢您的回答。我还不明白它的作用,并且必须看看我如何使其适应我的实际数据框。
    • 我会添加更详尽的解释。
    猜你喜欢
    • 1970-01-01
    • 2018-08-09
    • 1970-01-01
    • 2019-09-12
    • 1970-01-01
    • 1970-01-01
    • 2021-01-31
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多