【发布时间】:2017-11-17 17:51:22
【问题描述】:
我有数据框D,其中包含文档标题和文本,如下例所示:
document content
Doc 1 "This is an example of a document"
Doc 2 "And another one"
我需要使用 quanteda 包中的 tokenize 函数来标记每个文档,然后返回其原始文档标题列出的标记,如本例所示:
document content
Doc 1 "This"
Doc 1 "This is"
Doc 1 "This is an"
Doc 1 "This is an example"
这是我目前从文档列表中获取带有标记的数据框的过程:
require(textreadr)
D<-textreadr::read_dir("myDir")
D<-paste(D$content,collapse=" ")
strlist<-paste0(c(":","\\)",":","'",";","!","+","&","<",">","\\(","\\[","\\]","-","#",","),collapse = "|")
D<-gsub(strlist, "", D)
library(quanteda)
require(quanteda)
t<-tokenize(D, what = c("word","sentence", "character","fastestword", "fasterword"),
remove_numbers = FALSE, remove_punct = FALSE,
remove_symbols = FALSE, remove_separators = TRUE,
remove_twitter = FALSE, remove_hyphens = FALSE, remove_url = FALSE,
ngrams = 1:10, concatenator = " ", hash = TRUE,
verbose = quanteda_options("verbose"))
t<-unlist(t, use.names=FALSE)
t1<-data.frame(t)
但是,我找不到一种简单的方法来在标记化过程之后保留文档名称并相应地列出标记。有人可以帮忙吗?
【问题讨论】:
-
您可以使用嵌套的 data.frames 使用
dplyr、tidyr和purrr来执行此操作。