【问题标题】:How can I delete a row containing a specific string in R?如何在 R 中删除包含特定字符串的行?
【发布时间】:2019-04-25 22:15:42
【问题描述】:

我是使用 R 的新手。我正在使用一个数据集,并且缺失值已替换为“?”在我得到数据之前。我正在寻找一种方法来删除包含它的行。它不仅仅针对某一行,而是在所有行中。

我已经尝试过Delete rows containing specific strings in R,但它对我不起作用。到目前为止,我已经在下面包含了我的代码。

library(randomForest)
heart <- read.csv(url('http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data'))
names <- names(heart)
nrow(heart)
ncol(heart)
names(heart)

colnames(heart)[colnames(heart)=="X11"] <- "survival"
colnames(heart)[colnames(heart)=="X0"] <- "alive"
colnames(heart)[colnames(heart)=="X71"] <- "attackAge"
colnames(heart)[colnames(heart)=="X0.1"] <- "pericardialEffusion"
colnames(heart)[colnames(heart)=="X0.260"] <- "fractionalShortening"
colnames(heart)[colnames(heart)=="X9"] <- "epss"
colnames(heart)[colnames(heart)=="X4.600"] <- "lvdd"
colnames(heart)[colnames(heart)=="X14"] <- "wallMotionScore"
colnames(heart)[colnames(heart)=="X1"] <- "wallMotionIndex"
colnames(heart)[colnames(heart)=="X1.1"] <- "mult"
colnames(heart)[colnames(heart)=="name"] <- "patientName"
colnames(heart)[colnames(heart)=="X1.2"] <- "group"
colnames(heart)[colnames(heart)=="X0.2"] <- "aliveAfterYear"
names(heart)

【问题讨论】:

  • heart[rowSums(heart == "?") == 0, ]
  • 看看?grepl
  • 这些值是随机丢失还是故意丢失?也许 OP 应该考虑他或她是否应该保留或省略这些字符串。 na.omit 如下面的答案中所建议的那样,而一个好的选项并不总是适合 ML。

标签: r


【解决方案1】:
library(randomForest)
heart <- read.csv(url('http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data'),na.strings = "?")
names <- names(heart)
nrow(heart)
ncol(heart)
names(heart)

colnames(heart)[colnames(heart)=="X11"] <- "survival"
colnames(heart)[colnames(heart)=="X0"] <- "alive"
colnames(heart)[colnames(heart)=="X71"] <- "attackAge"
colnames(heart)[colnames(heart)=="X0.1"] <- "pericardialEffusion"
colnames(heart)[colnames(heart)=="X0.260"] <- "fractionalShortening"
colnames(heart)[colnames(heart)=="X9"] <- "epss"
colnames(heart)[colnames(heart)=="X4.600"] <- "lvdd"
colnames(heart)[colnames(heart)=="X14"] <- "wallMotionScore"
colnames(heart)[colnames(heart)=="X1"] <- "wallMotionIndex"
colnames(heart)[colnames(heart)=="X1.1"] <- "mult"
colnames(heart)[colnames(heart)=="name"] <- "patientName"
colnames(heart)[colnames(heart)=="X1.2"] <- "group"
colnames(heart)[colnames(heart)=="X0.2"] <- "aliveAfterYear"
names(heart)


heart1 <- na.omit(heart)

在导入文件时,您可以将 na.string 指定为 ?稍后使用 na.omit 你可以删除所有的 ?或 NA 字符串

【讨论】:

    【解决方案2】:

    我认为这可以做你想做的。

    # Do not forget to set stringsAsFactors as false to the read.csv 
    # as to make string comparison efficient
    heart <- read.csv(url('http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data'),stringsAsFactors = F)
    
    # Simpler way to assign column names to the dataframe
    colnames(heart) <- c("survival", "alive", "attackAge", "pericardialEffusion", 
                         "fractionalShortening", "epss", "lvdd", "wallMotionScore", 
                         "wallMotionIndex", "mult", "patientName", 
                         "group", "aliveAfterYear")
    
    
    # You can traverse a dataframe as a matrix using the row and column index 
    # as coordinates 
    
    for(r in 1:nrow(heart)){
       for(c in 1:ncol(heart)){
          # For this particular cell you do a comparison 
          # substituting the ? with NA which is the default missing value
          # in R 
          heart[r,c] <- ifelse(heart[r,c]=="?",NA,heart[r,c])
       }
    }
    
    # omit the NA rows 
    heart <- na.omit(heart)
    

    【讨论】:

      【解决方案3】:

      一些库支持读取 csv 文件并指定要作为缺失值读取的字符串。我最常使用readr 库。然后你就可以使用na.omit和类似的功能了。

      library(readr)
      library(dplyr)
      
      heart  <- read_csv(
        'http://archive.ics.uci.edu/ml/machine-learning-databases/echocardiogram/echocardiogram.data',
        na=c("", "?")
      )
      
      
      colnames(heart) <- recode(
        colnames(heart),
        "X11" = "survival",
        "X0" = "alive",
        "X71" = "attackAge",
        "X0.1" = "pericardialEffusion",
        "X0.260" = "fractionalShortening",
        "X9" = "epss",
        "X4.600" = "lvdd",
        "X14" = "wallMotionScore",
        "X1" = "wallMotionIndex",
        "X1.1" = "mult",
        "name" = "patientName",
        "X1.2" = "group",
        "X0.2" = "aliveAfterYear"
        )
      
      heart
      
      heart <- na.omit(heart)
      

      (您还可以使用 dplyr 包中的 recode 函数节省一些输入,但重命名列的解决方案效果很好。)

      【讨论】:

        猜你喜欢
        • 2014-04-10
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2011-02-25
        相关资源
        最近更新 更多