【问题标题】:Tf-idf of strings from csv file来自 csv 文件的字符串的 Tf-idf
【发布时间】:2014-07-29 20:25:04
【问题描述】:

我的test.csv 文件是(没有标题):

very good, very bad, you are great
very bad, good restaurent, nice place to visit

我想让我的语料库与, 分开,这样我的最终DocumentTermMatrix 变为:

      terms
 docs       very good      very bad        you are great   good restaurent   nice place to visit
  doc1       tf-idf          tf-idf         tf-idf          0                    0
  doc2       0                tf-idf         0                tf-idf             tf-idf

如果我不从csv file 加载documents,我能够正确生成上述DTM,如下所示:

library(tm)
docs <- c(D1 = "very good, very bad, you are great", 
    D2 = "very bad, good restaurent, nice place to visit")

dd <- Corpus(VectorSource(docs))
dd <- tm_map(dd, function(x) {
    PlainTextDocument(
       gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), 
       id=ID(x)
     )
})
inspect(dd)

# A corpus with 2 text documents
# 
# The metadata consists of 2 tag-value pairs and a data frame
# Available tags are:
#   create_date creator 
# Available variables in the data frame are:
#   MetaID 

# $D1
# very~good
# very~bad
# you~are~great
# 
# $D2
# very~bad
# good~restaurent
# nice~place~to~visit

dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
as.matrix(dtm)

这会产生

# Docs good~restaurent nice~place~to~visit very~bad very~good you~are~great
#   D1       0.0000000           0.0000000        0 0.3333333     0.3333333
#   D2       0.3333333           0.3333333        0 0.0000000     0.0000000

如果我从csv 文件加载document,那么只有每个文档的第一个词会像下面这样被加入:

> file_loc <- "testdata.csv"
> require(tm)
  Loading required package: tm
> x <- read.csv(file_loc, header = FALSE)
> x <- data.frame(lapply(x, as.character), stringsAsFactors=FALSE)
> dd <- Corpus(DataframeSource(x))
> dd <- tm_map(dd, stripWhitespace)
> dd <- tm_map(dd, tolower)
>  dd <- tm_map(dd, function(x) {
            PlainTextDocument(
            gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), 
            id=ID(x)
            )
          })
> inspect(dd)

像这样只加入第一个术语:

# $D1
# very~good

# 
# $D2
# very~bad

我怎样才能加入所有条款并像上面一样创建DocumentTermMatrix

【问题讨论】:

    标签: r csv machine-learning information-retrieval tf-idf


    【解决方案1】:

    您读取数据不正确。我使用scan 阅读。以下作品:

    docs <- scan("testdata.csv", "character", sep = "\n")
    
    dd <- Corpus(VectorSource(x))
    dd <- tm_map(dd, function(x) {
      PlainTextDocument(
        gsub("\\s+","~",strsplit(x,",\\s*")[[1]]), 
        id=ID(x)
      )
    })
    inspect(dd)
    
    dtm <- DocumentTermMatrix(dd, control = list(weighting = weightTfIdf))
    as.matrix(dtm)
    

    【讨论】:

      猜你喜欢
      • 2018-04-03
      • 1970-01-01
      • 2021-06-17
      • 2017-08-07
      • 2016-07-25
      • 2023-03-08
      • 1970-01-01
      • 1970-01-01
      • 2020-09-27
      相关资源
      最近更新 更多