【发布时间】:2021-11-27 23:48:30
【问题描述】:
我有一个使用“窗口”选项创建的 quanteda 令牌对象(参见下面的代码)。我有兴趣对一系列单词执行此操作,以便为自定义词典的创建提供信息。如何“去标记化”或将每个标记化的“窗口”文本连接或重新组合成一个字符串。每个字符串可以是列表中的一个项目,也可以是 data.frame 中的一行。我只需要能够在其上下文中阅读单词/短语(在本例中为“未来”)的实例。
是否有一些命令或代码可以让我“去标记化”这个?
library(quanteda)
library(dplyr)
# Example data
d <- c("Thank you Mr. Speaker. Mr. Speaker I’m not sure how, but to the department of PWTTS, regarding the question I’d asked previously about the future of our water reservoir. I wonder if that was looked at since I ask that question to Ms. Thompson. Thank you", "Thank you Mr. Speaker. Now if that doctor would be located in that community how is the logistics or air travel going to be, moving between the communities in the future. Thank you")
# Corpus
c <- corpus(d)
# My tokens object consisting of 3-word window around instances of "future".
ttt <- tokens(c, remove_punct = T, remove_numbers = F) %>%
tokens_keep( pattern = "future", window = 3)
【问题讨论】:
标签: r text tokenize corpus quanteda