【问题标题】:Removing columns from a data frame with repeated values从具有重复值的数据框中删除列
【发布时间】:2019-12-30 16:56:23
【问题描述】:

我有以下包含字符和数字的数据框,以及 NA:

df <- data.frame(a=c("notfound","NOT FOUND","NOT FOUND"), b=c(NA,"NOT FOUND","NOT FOUND"), c=c("not found",2,3), d=c("not   found","NOT FOUND","NOT FOUND"), e=c("234","NOT FOUND",NA))
          a         b         c           d         e
1  notfound      <NA> not found not   found       234
2 NOT FOUND NOT FOUND         2   NOT FOUND NOT FOUND
3 NOT FOUND NOT FOUND         3   NOT FOUND      <NA>

我想删除所有条目为“未找到”、“未找到”、“未找到”、“未找到”的所有列。基本上如果tolower(gsub(" ","",df)=="notfound")。似乎此操作不适用于数据帧。有其他选择吗?

期望的输出是:

         d            e
1    not found          234
2            2    NOT FOUND
3            3         <NA>

【问题讨论】:

    标签: r string dataframe


    【解决方案1】:

    您可以将grepl 与正则表达式一起使用来搜索与该表达式匹配的字符串,并仅保留某些元素不匹配的列(由FALSE grepl 输出指示),以便该列的匹配项小于nrow(df)。此模式匹配以“not”开头并以“found”结尾的字符串,并且grepl 设置为不区分大小写。

    is_nf <- 
      sapply(df, grepl, pattern = '(?=^not).*found$', 
             perl = TRUE, ignore.case = TRUE)
    
    
    df[colSums(is_nf) < nrow(df)]
    #           b         c         e
    # 1      <NA> not found       234
    # 2 NOT FOUND         2 NOT FOUND
    # 3 NOT FOUND         3      <NA>
    

    我猜您还想删除唯一非“未找到”为 NA 的列。

    is_na <- is.na(df)
    
    df[colSums(is_nf | is_na) < nrow(df)]
    #           c         e
    # 1 not found       234
    # 2         2 NOT FOUND
    # 3         3      <NA>
    

    【讨论】:

      猜你喜欢
      • 2014-02-27
      • 2021-04-28
      • 2012-01-13
      • 1970-01-01
      • 2020-02-03
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-11-13
      相关资源
      最近更新 更多