【发布时间】:2018-07-11 10:53:34
【问题描述】:
我有一个数据框 df:
df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L),
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6",
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor")), .Names = c("page","text"), row.names = c(NA, -4L), class = "data.frame")
另外,我有一个单词列表:
wordlist <- c("Audi", "BMW", "extended", "engine", "replacement", "Volkswagen", "company", "Toyota","exchange", "brand")
我通过取消列出文本并使用 grepl 从 wordlist 中查找是否存在于列文本中的单词。
library(data.table)
setDT(df)[, match := paste(wordlist[unlist(lapply(wordlist, function(x) grepl(x, text, ignore.case = T)))], collapse = ","), by = 1:nrow(df)]
问题是,我想找到列文本中存在的单词列表的确切单词。 使用 grepl 它还显示部分匹配的单词,例如来自文本的 AudiA6 也部分匹配到单词列表中的单词 Audi。此外,我的数据框非常大,使用 grepl 需要花费大量时间来运行代码。如果可能,请推荐任何其他方法来这样做。我想要这样的东西:
df <- structure(list(page = c(12, 6, 9, 65),
text = structure(c(4L,2L, 1L, 3L),
.Label = c("I just bought a brand new AudiA6", "Get 2 years engine replacement warranty on BMW X6",
"Volkswagen is the parent company of BMW", "ToyotaCorolla is offering new car exchange offers"),
class = "factor"), match = c("exchange", "BMW,engine,replacement",
"brand", "BMW,Volkswagen,company")), row.names = c(NA, -4L),
class = c("data.table", "data.frame"))
【问题讨论】:
标签: r data.table string-matching grepl