在匹配之前找到两个单词答案

【问题标题】：Find two words before match在匹配之前找到两个单词
【发布时间】：2020-03-23 14:17:04
【问题描述】：

我正在尝试使用正则表达式拆分字符串。我的正则表达式代码应该匹配冒号前的两个单词，最终目标是拆分如下内容：

"Joe Biden: We need to reform healthcare. It is important. Bernie Sanders: I agree. It is important."

变成这样的字符串向量：

"Joe Biden" "We need to reform healthcare. It is important." "Bernie Sanders" "I agree. It is important"

我得到的最接近的是：

foo <- strsplit(my_string, split="(\\S+)\\s*(\\S+)\\s*:",perl=TRUE)

但结果会删除正则表达式匹配项。我尝试像这样使用lookbehind：

foo <- strsplit(my_string, split="(?<=.)(?=(\\S+)\\s*(\\S+)\\s*:)",perl=TRUE)

但是它会抛出一个错误：

  PCRE pattern compilation error
    'lookbehind assertion is not fixed length'
    at ')'

是否有替代的正则表达式代码来完成此操作，或者我应该使用不同的函数？

【问题讨论】：

您想要达到的目标与您的last question 有何不同？
最后一个答案非常有帮助，但前提是语句以标点符号结尾。对于某些陈述，主持人会在说话者中间打断，因此没有标点符号。匹配冒号前的两个词（说话者的名字）将捕获所有情况。

标签： r regex stringr

【解决方案1】：

这分为由 or 运算符 | 分隔的两件事。 1) 一个空格，后跟两个单词，用空格隔开，然后是一个冒号；2) 一个冒号，后跟一个空格。

my_string <- "Joe Biden: We need to reform healthcare. It is important. Bernie Sanders: I agree. It is important."
strsplit(my_string, split="( (?=\\w+ \\w+:)|: )",perl=TRUE)
[[1]]
[1] "Joe Biden"            "We need to reform healthcare. It is important."
[3] "Bernie Sanders"       "I agree. It is important."

如果说话者的名字只有一个词，你会在这里遇到麻烦。这就是在我对您上一个问题的回答中寻找标点符号的目的。

【讨论】：