R正则表达式替换字符串[重复]答案

【问题标题】：R regex to replace string [duplicate]R正则表达式替换字符串[重复]
【发布时间】：2023-04-09 04:05:01
【问题描述】：

我有一个建设性 cmets 数据集，并希望在分析的早期阶段删除 csv 中发现的常见阳性 cmets 列表。

原始数据集与此类似：

  df <-
  data.frame(
    "SuveyID" = 1:10,
    "NI" = c(
      "too many quizs",
      "very vague and conflicting instructions sometimes",
      "way too many emails hard to keep up",
      "technology issue",
      "all is good",
      "all perfect",
      "no improvements",
      "sometimes goes off topic",
      "connection issues of internet",
      "all is well"
    )
  )

我需要删除的列表与此类似，重要的是此列表来自 csv：

remove <-
  data.frame(
    "Strings.to.replace.with.NA" = c(
      "all is good", 
      "all is well", 
      "all perfect")
    )

删除数据集中的字符串出现在NI数据集中，我想用NA替换它。

我似乎遇到的问题是崩溃“|”跨越 csv 中的记录。我似乎无法让它工作。我尝试了多个版本的 str_replace_all、str_replace、stri_detect_regex。但是我没有正确的模式折叠“|”。

一如既往地非常感谢您的帮助。

【问题讨论】：

你只需要df$NI[df$NI %in% remove$Strings.to.replace.with.NA] <- NA吗？见R demo。见this answer
是的，这看起来效果很好！！！天哪，速度很快，非常感谢@WiktorStribiżew
我和 akrun 的解决方案对你们都不起作用，只有一个真正起作用，请让我知道上面问题的答案。

标签： r regex string replace stringr

【解决方案1】：

我们可以使用paste 和collapse="|" 将“删除”元素连接到单个字符串，并在gsub（基数R）中使用它

df$NI <- gsub(paste0("\\b(", paste(remove[[1]], collapse="|"), ")\\b"), "", df$NI)
df$NI
#[1] "too many quizs"                                    "very vague and conflicting instructions sometimes"
#[3] "way too many emails hard to keep up"               "technology issue"                                 
#[5] ""                                                  ""                                                 
#[7] "no improvements"                                   "sometimes goes off topic"                         
#[9] "connection issues of internet"                     ""

或者使用str_remove_all 和str_c

library(stringr)
str_remove_all(df$NI, str_c("\\b(", str_c(remove[[1]], collapse="|"), ")\\b"))

【讨论】：

非常感谢@akrun，我非常感谢您提供这些有用的解决方案。