【问题标题】:subscript out of bounds error in document-term matrix文档项矩阵中的下标越界错误
【发布时间】:2021-10-04 12:46:04
【问题描述】:

我正在对以下数据进行文本挖掘,但最后出现以下错误

Error in `[.simple_triplet_matrix`(dtm, 1:10, 1:10) : 
  subscript out of bounds

你能帮我解决这个错误吗?

library(rvest)  
library(tm)
library(snowball)
wiki_url <- read_html("https://wiki.socr.umich.edu/index.php/SOCR_Data_2011_US_JobsRanking")    
html_nodes(wiki_url, "#content")    
job <- html_table(html_nodes(wiki_url, "table")[[1]])   
head(job)   

#'  
#' ## Step 1: make a VCorpus object 
#'  
#'  
jobCorpus<-VCorpus(VectorSource(job[, 10])) 
#'  
#'  
#' ## Step 2: clean the VCorpus object  
#'  
#'  
jobCorpus<-tm_map(jobCorpus, tolower)   
for(j in seq(jobCorpus)){   
  jobCorpus[[j]] <- gsub("_", " ", jobCorpus[[j]])  
}   
#   
#   
jobCorpus<-tm_map(jobCorpus, removeWords, stopwords("english")) 
jobCorpus<-tm_map(jobCorpus, removePunctuation) 
jobCorpus<-tm_map(jobCorpus, stripWhitespace)   
jobCorpus<-tm_map(jobCorpus, PlainTextDocument) 
jobCorpus<-tm_map(jobCorpus, stemDocument)  
#
#   
# build document-term matrix    
#   
# Term Document Matrix (TDM) objects (`tm::DocumentTermMatrix`) contain a sparse term-document matrix or document-term matrix and attribute weights of the matrix.  
#   
# First make sure that we got a clean VCorpus object    
#   
jobCorpus[[1]]$content  
#   
#   
# Then we can start to build the DTM and reassign labels to the `Docs`. 

    
dtm<-DocumentTermMatrix(jobCorpus)  
dtm 
dtm$dimnames$Docs<-as.character(1:200)  
inspect(dtm[1:10, 1:10]) ###<-- error happens from here 

#' Let's subset the `dtm` into top 30 jobs and bottom 100 jobs. 
    
    
dtm_top30<-dtm[1:30, ]  
dtm_bot100<-dtm[101:200, ]  

【问题讨论】:

    标签: r list tm


    【解决方案1】:

    2 个问题。首先,以这种方式使用tolower 会剥夺语料库中的太多信息。其次,您应该使用DataframeSource 而不是VectorSource。使用 VectorSource 时,您只需加载 1 个 200 行的文档,而不是 200 个文档,每行一行。

    下面的代码有效,我从你创建工作 data.frame 的地方开始:

    #you need the columns doc_id and text, you could rename 2 columns in job as well. 
    # instead of doc_id as a doc_# you could also take the job title column
    job_for_corpus <- data.frame(doc_id = paste0("doc_", job$Index),
                                 text = job$Description, stringsAsFactors = FALSE)
    
    # no need for loop, just use gsub on data.frame column
    job_for_corpus$text <- gsub("_", " ", job_for_corpus$text)
    
    # create corpus
    jobCorpus <- VCorpus(DataframeSource(job_for_corpus))
    
    # clean text
    jobCorpus <- tm_map(jobCorpus, content_transformer(tolower))   
    jobCorpus <- tm_map(jobCorpus, removeWords, stopwords("english")) 
    jobCorpus <- tm_map(jobCorpus, removePunctuation) 
    jobCorpus <- tm_map(jobCorpus, stripWhitespace)   
    jobCorpus <- tm_map(jobCorpus, stemDocument)  
    
    
    jobCorpus[[1]]$content  
    [1] "research design develop maintain softwar system along hardwar develop medic scientif industri purpos"
    
    # create document term matrix
    dtm <- DocumentTermMatrix(jobCorpus)  
    
    inspect(dtm[1:10, 1:10]) 
    <<DocumentTermMatrix (documents: 10, terms: 10)>>
    Non-/sparse entries: 2/98
    Sparsity           : 98%
    Maximal term length: 7
    Weighting          : term frequency (tf)
    Sample             :
            Terms
    Docs     16wheel abnorm access accid accord account accur achiev act activ
      doc_1        0      0      0     0      0       0     0      0   0     0
      doc_10       0      0      0     0      0       0     0      0   0     0
      doc_2        0      0      0     0      0       0     0      0   0     0
      doc_3        0      0      0     1      0       0     0      0   0     0
      doc_4        0      0      0     0      0       0     0      0   0     0
      doc_5        0      0      0     0      0       0     0      0   0     0
      doc_6        0      0      0     0      0       0     0      0   0     0
      doc_7        0      0      0     0      0       0     0      0   0     0
      doc_8        0      0      0     0      1       0     0      0   0     0
      doc_9        0      0      0     0      0       0     0      0   0     0
    
    # rest of your code
    

    【讨论】:

      【解决方案2】:

      替代@phiver 提供的答案,在“head(job)”将作业转换为“list”之后......

      jobs

      ....

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2020-03-02
        • 1970-01-01
        • 1970-01-01
        • 2015-04-03
        • 2012-08-03
        • 2021-12-04
        • 2020-06-03
        相关资源
        最近更新 更多