【问题标题】:Remove duplicate rows which have values as that of column header删除具有列标题值的重复行
【发布时间】:2019-07-17 18:44:30
【问题描述】:

我的数据有点像这样:

    +--------+--------+--------+
| region |  name  | salary |
+--------+--------+--------+
| west   | raj    | 100    |
| north  | simran | 150    |
| region | name   | salary |
| east   | prem   | 250    |
| region | name   | salary |
| south  | preeti | 200    |
+--------+--------+--------+

我的列标题的名称在第 3 行和第 5 行中重复。如何使用 R 删除第 3 行和第 5 行并保留列标题,以便我的输出如下所示:

+--------+--------+--------+
| region |  name  | salary |
+--------+--------+--------+
| west   | raj    |    100 |
| north  | simran |    150 |
| east   | prem   |    250 |
| south  | preeti |    200 |
+--------+--------+--------+

假设我的原始数据有太多行,我不想简单地选择行号并使用命令 Data[-c(3, 5), ] 删除它们

【问题讨论】:

  • 请提供一个最小的工作示例。但是您的问题很简单,只需使用 grep 或其他此类函数来标识与 colnames 匹配的任何行

标签: r duplicates rows columnheader


【解决方案1】:

这是一个简单的解决方案

x <- data.frame(x =c("a", "b", "c", "x"), z = c("a", "b", "c", "z"))
## identify rows which match colnames 
matched <- apply(x,1, function(i) i[1] %in% colnames(x) && i[2] %in% colnames(x))

## Take the inverse of the match
x[!matched,]

【讨论】:

    【解决方案2】:

    使用带有过滤器的 str_detect() 来删除这些行。

    library(tidyverse)
    df <- tibble(
        region = c("west", "north", "region", "east","region","south"),
        name = c("raj", "simran","name","prem", "name","preeti"),
        salary = c("100","150","salary","250","salary","200")
    )
    
    df_2 <- df %>%
        filter(!str_detect(salary,"[Aa-zZ]"))
    
    df_2
    

    或者你可以使用base R

    df_2 <- df[-grep("[Aa-zZ]",df$salary),]
    df_2
    

    【讨论】:

      【解决方案3】:

      假设,salary 是一个数字字段,您可以简单地这样做 -

      # assuming df is your dataframe
      
      clean_df <- df[!is.na(as.numeric(df$salary)), ]
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 2015-05-25
        • 1970-01-01
        • 2014-11-13
        • 1970-01-01
        • 2022-11-21
        • 2017-03-02
        • 2020-07-08
        • 1970-01-01
        相关资源
        最近更新 更多