【问题标题】:Select second word in between known phrases - R regex在已知短语之间选择第二个单词 - R regex
【发布时间】:2021-02-16 16:55:21
【问题描述】:

我想选择已知短语之间的文本,但排除第一个单词使用 R 和正则表达式。格式如下

"known phrase + unknown_word + target phrase + known_word + bla bla"

例如:

Tesco Plc sells coffee beans today in stores over the uk

Known phrase = "Tesco Plc"
Unknown word = "sells"
Target phrase = "coffee beans"
known word = "today"
bla bla (unrelated text) = "in stores over the uk"

初步尝试

text = "Tesco Plc sells coffee beans today in stores over the uk"
known_phrase = "Tesco Plc"
known_word = "today"

# code
str_extract(text, paste0("(?<=",known_phrase,").*(?=", known_word ,")"))]

这会同时选择unknown_wordtarget phrase。但我只想要target phrase/

【问题讨论】:

  • stringr::str_match(x, "Tesco\\s+Plc\\s+\\w+\\s+(.*?)\\s+today")[,2]?见regex101.com/r/oztc5i/1。当您的上下文不是静态的时,str_extract 就没有那么灵活了。
  • str_remove 结合使用效果更好,非常感谢!!

标签: r regex stringr


【解决方案1】:

你可以使用

stringr::str_match(x, "Tesco\\s+Plc\\s+\\w+\\s+(.*?)\\s+today")[,2]
## OR
Known_phrase = "Tesco Plc"
known_word = "today"
stringr::str_match(x, paste0(Known_phrase, "\\s+\\w+\\s+(.*?)\\s+", known_word))[,2]

您可能需要转义函数,因为您的变量是动态的:

regex.escape <- function(string) {
  gsub("([][{}()+*^$|\\\\?.])", "\\\\\\1", string)
}
Known_phrase = "Tesco Plc"
known_word = "today"
stringr::str_match(x, paste0(regex.escape(Known_phrase), "\\s+\\w+\\s+(.*?)\\s+", regex.escape(known_word)))[,2]

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2015-09-08
    • 2018-04-29
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多