每年删除语料库中的单词答案

【问题标题】：Remove words per year in a corpus每年删除语料库中的单词
【发布时间】：2020-06-18 08:33:19
【问题描述】：

我正在使用一个演讲时间跨度为数年的语料库（汇总到人年级别）。我想删除一年内出现次数少于 4 次的单词（不是针对整个语料库删除，而是只删除未达到阈值的那一年）。

我尝试了以下方法：

DT$text <- ifelse(grepl("1998", DT$session), mgsub(DT$text, words_remove_1998, ""), DT$text)

and 

DT$text <- ifelse(grepl("1998", DT$session), str_remove_all(DT$text, words_remove_1998), DT$text)

and 

DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, words_remove_1998), DT$text)

and

DT$text <- ifelse(grepl("1998", DT$session), drop_element(DT$text, words_remove_1998), DT$text)

但是，似乎没有一个工作。 Mgsub 只是用 "" 代替 1998 的整个语音，而其他选项则给出错误消息。 removeWords 不起作用的原因是我的 words_remove_1998 向量太大。我试图拆分单词向量并遍历单词（参见下面的代码），但 R 似乎不喜欢这样（永远运行）。

group <- 100
n <- length(words_remove_1998)
r <- rep(1:ceiling(n/group),each=group)[1:n]
d <- split(words_remove_1998,r)

for (i in 1:length(d)) {
  DT$text <- ifelse(grepl("1998", DT$session), removeWords(DT$text, c(paste(d[[i]]))), DT$text)
}

对于如何解决这个问题有什么建议吗？

感谢您的帮助！

可重现的例子：

text <- rbind(c("i like ice cream"), c("banana ice cream is my favourite"), c("ice cream is not my thing"))
name <- rbind(c("Arnold Ford"), c("Arnold Ford"), c("Leslie King"))
session <- rbind("1998", "1999", "1998")

DT <- cbind(name, session, text)

words_remove_1998 <- c("like", "ice", "cream")

newtext <- rbind(c("i"), c("banana ice cream is my favourite"), c("is not my thing"))
DT <- cbind(DT, newtext)

我想要删除的真实词向量包含 30k 个元素。

【问题讨论】：

您的示例字符串是什么？您的预期结果是什么？
我现在用一个可重现的例子更新了这个问题。希望这会有所帮助。
啊哈，所以如果 Col2 是 1998，words_remove_1998 中的所有单词都应该从 Col3 中删除，结果应该存储在 Col4 中。对吗？
你的输入不是更像this吗？
是的！或者它也可以覆盖 Col3。

标签： r regex if-statement text nlp

【解决方案1】：

我最终没有使用任何包装，因为它们都无法处理数据的大小。 Insted 我是用老式简单的方法做的；将文本分成几行，计算每个会话（年份）和人的每个单词的出现次数，然后删除对应于小于阈值的行（与我用来识别要删除的单词的向量的限制相同）。最后，我将数据聚合回初始水平（人年）。

这只是单词，因为我正在根据阈值删除单词。如果我有一个要删除的单词列表，而我无法以这种方式删除，那我会遇到更多麻烦。

DT_separate <- separate_rows(DT, text)


df <- DT_separate %>%
  dplyr::group_by(session, text) %>%
  dplyr::mutate(count = dplyr::n())

df <- df[df$count >5, ]

df <- aggregate(
  text ~ x,      #where x is a person-year id
  data=df, 
  FUN=paste, collapse=' '
)

names(df)[names(df) == 'text'] <- 'text2'

DT <- left_join(DT, df, by="x")

DT$text <- DT$text2
DT <- DT[, !(colnames(DT) %in% c("text2"))]

【讨论】：