【问题标题】:using regular expression to combine words in R使用正则表达式组合R中的单词
【发布时间】:2016-10-31 14:07:13
【问题描述】:

我有非结构化文本,我想组合一些单词,以便为我的文本挖掘任务保留概念。例如,在下面的字符串中,我想将“High pressure”更改为“High_pressure”,将“not working”更改为“not_working”,将“No air”更改为“No_air”。

示例文字

c(" High pressure was the main problem in the machine","the system is not working right now","No air in the system")

单词列表

c('low', 'high', 'no', 'not')

期望的输出

# [1] " High_pressure was the main problem in the machine"
# [2] "the system is not_working right now"               
# [3] "No_air in the system"    

【问题讨论】:

  • 您是否有包含所有前缀(高、否、非等)的详尽列表?
  • (低、高、不、不)
  • 你应该使用二元组和三元组,而不是组合单词。
  • @vagabond 我是 tri-gram 分析的忠实粉丝,但它可能不适合这里的用例。具体来说,似乎 OP 试图捕捉对其分析而言特别的概念配对,而不是所有单词组合。

标签: r regex


【解决方案1】:

首先,保存文本输入和要连接的修改词列表:

textIn <- 
  c(" High pressure was the main problem in the machine","the system is not working right now","No air in the system")

prefix <- c("high", "low", "no", "not")

然后,构建一个正则表达式来捕获这些单词后跟一个空格。请注意,我使用\b 以确保我们不会意外地将它们捕获为单词的结尾,例如“慢”

gsub(
  paste0("\\b(", paste(prefix, collapse = "|"),") ")
  , "\\1_", textIn, ignore.case = TRUE
)

返回

[1] " High_pressure was the main problem in the machine"
[2] "the system is not_working right now"          
[3] "No_air in the system" 

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-08-07
    • 1970-01-01
    • 2012-03-10
    相关资源
    最近更新 更多