【发布时间】:2021-04-10 11:14:22
【问题描述】:
我想计算引理 GO 的搭配,包括其所有形式,例如 go、goes、gone 等:
go <- c("go after it", "here we go", "he went bust", "go get it go", "i 'm gon na go", "she 's going berserk")
引理形式存储在这个向量中:
lemma_GO <- c("go", "goes", "going", "gone", "went", "gon na")
这个向量将它们变成一个交替模式:
pattern_GO <- paste0("\\b(", paste0(lemma_GO, collapse = "|"), ")\\b")
但是,当使用带有str_extract_all 的模式来提取GO 的紧靠左侧的搭配时,提取会遗漏那些GO 是字符串中的第一个单词并且稍后在字符串中再次出现的字符串:
library(stringr)
str_extract_all(go, paste0("'?\\b[a-z']+\\b(?=\\s?", pattern_GO, ")"))
[[1]]
character(0)
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
预期的结果是这样的:
[[1]]
[1] NA
[[2]]
[1] "we"
[[3]]
[1] "he"
[[4]]
[1] NA "it"
[[5]]
[1] "'m" "na"
[[6]]
[1] "'s"
在没有左手搭配的情况下,如何修改提取也返回NA?
【问题讨论】: