【发布时间】:2014-08-03 05:03:41
【问题描述】:
我有以下代码:
# returns string w/o leading or trailing whitespace
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
news_corpus <- Corpus(VectorSource(news_raw$text)) # a column of strings.
corpus_clean <- tm_map(news_corpus, tolower)
corpus_clean <- tm_map(corpus_clean, removeNumbers)
corpus_clean <- tm_map(corpus_clean, removeWords, stopwords('english'))
corpus_clean <- tm_map(corpus_clean, removePunctuation)
corpus_clean <- tm_map(corpus_clean, stripWhitespace)
corpus_clean <- tm_map(corpus_clean, trim)
news_dtm <- DocumentTermMatrix(corpus_clean) # errors here
当我运行DocumentTermMatrix() 方法时,它给了我这个错误:
错误:inherits(doc, "TextDocument") 不是 TRUE
为什么会出现此错误?我的行不是文本文档吗?
这是检查corpus_clean时的输出:
[[153]]
[1] obama holds technical school model us
[[154]]
[1] oil boom produces jobs bonanza archaeologists
[[155]]
[1] islamic terrorist group expands territory captures tikrit
[[156]]
[1] republicans democrats feel eric cantors loss
[[157]]
[1] tea party candidates try build cantor loss
[[158]]
[1] vehicles materials stored delaware bridges
[[159]]
[1] hill testimony hagel defends bergdahl trade
[[160]]
[1] tweet selfpropagates tweetdeck
[[161]]
[1] blackwater guards face trial iraq shootings
[[162]]
[1] calif man among soldiers killed afghanistan
[[163]]
[1] stocks fall back world bank cuts growth outlook
[[164]]
[1] jabhat alnusra longer useful turkey
[[165]]
[1] catholic bishops keep focus abortion marriage
[[166]]
[1] barbra streisand visits hill heart disease
[[167]]
[1] rand paul cantors loss reason stop talking immigration
[[168]]
[1] israeli airstrike kills northern gaza
编辑:这是我的数据:
type,text
neutral,The week in 32 photos
neutral,Look at me! 22 selfies of the week
neutral,Inside rebel tunnels in Homs
neutral,Voices from Ukraine
neutral,Water dries up ahead of World Cup
positive,Who's your hero? Nominate them
neutral,Anderson Cooper: Here's how
positive,"At fire scene, she rescues the pet"
neutral,Hunger in the land of plenty
positive,Helping women escape 'the life'
neutral,A tour of the sex underworld
neutral,Miss Universe Thailand steps down
neutral,China's 'naked officials' crackdown
negative,More held over Pakistan stoning
neutral,Watch landmark Cold War series
neutral,In photos: History of the Cold War
neutral,Turtle predicts World Cup winner
neutral,What devoured great white?
positive,Nun wins Italy's 'The Voice'
neutral,Bride Price app sparks debate
neutral,China to deport 'pork' artist
negative,Lightning hits moving car
neutral,Singer won't be silenced
neutral,Poland's mini desert
neutral,When monarchs retire
negative,Murder on Street View?
positive,Meet armless table tennis champ
neutral,Incredible 400 year-old globes
positive,Man saves falling baby
neutral,World's most controversial foods
我像这样检索:
news_raw <- read.csv('news_csv.csv', stringsAsFactors = F)
编辑:这是回溯():
> news_dtm <- DocumentTermMatrix(corpus_clean)
Error: inherits(doc, "TextDocument") is not TRUE
> traceback()
9: stop(sprintf(ngettext(length(r), "%s is not TRUE", "%s are not all TRUE"),
ch), call. = FALSE, domain = NA)
8: stopifnot(inherits(doc, "TextDocument"), is.list(control))
7: FUN(X[[1L]], ...)
6: lapply(X, FUN, ...)
5: mclapply(unname(content(x)), termFreq, control)
4: TermDocumentMatrix.VCorpus(x, control)
3: TermDocumentMatrix(x, control)
2: t(TermDocumentMatrix(x, control))
1: DocumentTermMatrix(corpus_clean)
当我评估 inherits(corpus_clean, "TextDocument") 时,它是 FALSE。
【问题讨论】:
-
如果我使用
data(crude); news_corpus <- crude;然后运行所有转换,我不会收到错误消息。news_raw$text到底是什么样的?这是什么课? -
是一个字符类。这听起来不对 - 我该如何更改?
-
其实“字”是对的。这就是 R 所说的。其他语言可能称它们为字符串。但就目前而言,我无法在没有数据的情况下重现您的问题。您能否提供一个最小的、可重现的示例,我可以运行该示例以得到相同的错误?
-
您仍然收到错误消息吗?我想添加
traceback()的结果应该有望识别发生错误的(子)函数。收到错误后只需运行该命令即可。 -
这是个问题。你真的像上面一样运行代码吗?
Corpus(VectorSource(news_raw$text))应该将所有内容都转换为纯文本文档。当我运行sapply( ,class)时,我得到character, PlainTextDocument, TextDocument。