【问题标题】:Loop through a tm corpus without losing corpus structure循环遍历 tm 语料库而不丢失语料库结构
【发布时间】:2017-04-25 07:58:45
【问题描述】:

我有一个 tm 文档语料库和一个单词列表。我想在语料库上运行for 循环,以便循环按顺序从语料库中删除列表中的每个单词。

一些复制数据:

library(tm)
m <- cbind(c("Apple blue two","Pear yellow five","Banana yellow two"),
             c(1, 2, 3))
tm_corpus <- Corpus(VectorSource(m[,1]))
words <- as.list(c("Apple", "yellow", "two"))

tm_corpus 现在是一个包含 3 个文档的语料库对象:

<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 3

words 是 3 个单词的列表:

[[1]]
[1] "Apple"

[[2]]
[1] "yellow"

[[3]]
[1] "two"

我尝试了三种不同的循环。第一个是:

tm_corpusClean <- tm_corpus
for (i in seq_along(tm_corpusClean)) {
  for (u in seq_along(words)) {
    tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords, words[[u]])
  }
}

返回以下错误 7 次(编号 1-7):

Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions
In addition: Warning messages:
1: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,                 
words[[u]]) :
  number of items to replace is not a multiple of replacement length
2: In tm_corpusClean[i] <- tm_map(tm_corpusClean[i], removeWords,         
words[[u]]) :
  number of items to replace is not a multiple of replacement length
[...]

第二个是:

tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
  for (u in seq_along(tm_corpusClean)) {
    tm_corpusClean[u] <- tm_map(tm_corpusClean[u], removeWords, words[[i]])
  }
}

返回错误:

Error in x$dmeta[i, , drop = FALSE] : incorrect number of dimensions

最后一个循环是:

tm_corpusClean <- tm_corpus
for (i in seq_along(words)) {
  tm_corpusClean <- tm_map(tm_corpusClean, removeWords, words[[i]])
}

这实际上返回了一个名为tm_corpusClean的对象,但是这个对象只返回了第一个文档,而不是原来的三个:

inspect(tm_corpusClean[[1]])

<<PlainTextDocument>>
Metadata:  7
Content:  chars: 6

 blue 

我哪里错了?

【问题讨论】:

    标签: r for-loop tm


    【解决方案1】:

    在我们进行顺序删除之前,测试tm_map 是否适用于您的示例:

    obj1 <- tm_map(tm_corpus, removeWords, unlist(words))
    sapply(obj1, `[`, "content")
    
    $`1.content`
    [1] " blue "
    
    $`2.content`
    [1] "Pear  five"
    
    $`3.content`
    [1] "Banana  "
    

    接下来,使用 lapply 依次删除一个单词,即"Apple", "yellow", "two"

    obj2 <- lapply(words, function(word) tm_map(tm_corpus, removeWords, word))
    sapply(obj2, function(x) sapply(x, `[`, "content"))
    
              [,1]                [,2]             [,3]              
    1.content " blue two"         "Apple blue two" "Apple blue "     
    2.content "Pear yellow five"  "Pear  five"     "Pear yellow five"
    3.content "Banana yellow two" "Banana  two"    "Banana yellow "  
    

    请注意,生成的语料库在一个嵌套列表中(为什么使用两个 sapply 来查看内容)。

    【讨论】:

    • 嗨,亚当,感谢您的回答。您的代码有效,但给了我 NA 而不是您在此处显示的输出:obj1 &lt;- tm_map(tm_corpus, removeWords, unlist(words)) sapply(obj1, [, "content")[1] NA NA NA obj2 &lt;- lapply(words, function(word) tm_map(tm_corpus, removeWords, word)) sapply(obj2, function(x) sapply(x, [, "content")) [,1] [,2] [,3] [1,] NA NA NA [2,] NA NA NA [3,] NA NA NA 抱歉,无法弄清楚如何添加换行符。
    • 对于obj1 &lt;- tm_map(tm_corpus, removeWords, unlist(words)),如果你要检查obj1[[1]]$content,你得到了什么?
    • obj1[[1]]$content 确实返回 [1] " blue ",因此 NA 仅在运行 sapply(obj1, [, "content") 后出现,给出 [1] NA NA NA。但它似乎对语料库本身起作用。 :)
    • 这很奇怪。 `[` 应该等同于 $
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2021-09-28
    • 2021-10-03
    • 1970-01-01
    • 2014-07-25
    • 2014-08-16
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多