来自 csv 文件的字符串的 Tf-idf答案

【问题标题】：Tf-idf of strings from csv file来自 csv 文件的字符串的 Tf-idf
【发布时间】：2014-07-29 20:25:04
【问题描述】：

我的test.csv 文件是（没有标题）：

very good, very bad, you are great
very bad, good restaurent, nice place to visit

我想让我的语料库与, 分开，这样我的最终DocumentTermMatrix 变为：

      terms
 docs       very good      very bad        you are great   good restaurent   nice place to visit
  doc1       tf-idf          tf-idf         tf-idf          0                    0
  doc2       0                tf-idf         0                tf-idf             tf-idf

如果我不从csv file 加载documents，我能够正确生成上述DTM，如下所示：

library(tm)
docs <- c(D1 = "very good, very bad, you are great", 
    D2 = "very bad, good restaurent, nice place to visit")

dd <- Corpus(VectorSource(docs))
dd <- tm_map(dd, function(x) {
    PlainTextDocument(
       gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), 
       id=ID(x)
     )
})
inspect(dd)

# A corpus with 2 text documents
# 
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
#   create_date creator 
# Available variables in the data frame are:
#   MetaID 

# $D1
# very~good
# very~bad
# you~are~great
# 
# $D2
# very~bad
# good~restaurent
# nice~place~to~visit

dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)

这会产生

# Docs good~restaurent nice~place~to~visit very~bad very~good you~are~great
#   D1       0.0000000           0.0000000        0 0.3333333     0.3333333
#   D2       0.3333333           0.3333333        0 0.0000000     0.0000000

如果我从csv 文件加载document，那么只有每个文档的第一个词会像下面这样被加入：

> file_loc <- "testdata.csv"
> require(tm)
  Loading required package: tm
> x <- read.csv(file_loc, header = FALSE)
> x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)
> dd <- Corpus(DataframeSource(x))
> dd <- tm_map(dd, stripWhitespace)
> dd <- tm_map(dd, tolower)
>  dd <- tm_map(dd, function(x) {
            PlainTextDocument(
            gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), 
            id=ID(x)
            )
          })
> inspect(dd)

像这样只加入第一个术语：

# $D1
# very~good

# 
# $D2
# very~bad

我怎样才能加入所有条款并像上面一样创建DocumentTermMatrix。

【问题讨论】：

标签： r csv machine-learning information-retrieval tf-idf

【解决方案1】：

您读取数据不正确。我使用scan 阅读。以下作品：

docs <- scan("testdata.csv", "character", sep = "\n")

dd <- Corpus(VectorSource(x))
dd <- tm_map(dd, function(x) {
  PlainTextDocument(
    gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), 
    id=ID(x)
  )
})
inspect(dd)

dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)

【讨论】：