在语料库的每个文档中查找最频繁的术语答案

【问题标题】：Finding most frequent term in each document of a corpus在语料库的每个文档中查找最频繁的术语
【发布时间】：2025-11-26 10:50:01
【问题描述】：

我一直在使用 R 的 tm 包，在分类问题上取得了很大成功。我知道如何在整个语料库中找到最常用的术语（使用findFreqTerms()），但在文档中看不到任何可以找到最常用术语的内容（在我删除并删除停用词之后，但在我删除之前稀疏术语）在语料库中的每个单独文档中。我试过使用apply() 和max 命令，但这给了我每个文档中该术语出现的最大次数，而不是术语本身的名称。

library(tm)

data("crude")
corpus<-tm_map(crude, removePunctuation)
corpus<-tm_map(corpus, stripWhitespace)
corpus<-tm_map(corpus, tolower)
corpus<-tm_map(corpus, removeWords, stopwords("English"))
corpus<-tm_map(corpus, stemDocument)
dtm <- DocumentTermMatrix(corpus)
maxterms<-apply(dtm, 1, max)
maxterms
127 144 191 194 211 236 237 242 246 248 273 349 352 
 5  13   2   3   3  10   8   3   7   9   9   4   5 
353 368 489 502 543 704 708 
 4   4   4   5   5   9   4

想法？

【问题讨论】：

标签： r apply text-mining tm

【解决方案1】：

Ben 的回答给出了你所要求的，但我不确定你所要求的是否明智。它不考虑关系。这是一种使用the qdap package 的方法和第二种方法。他们将为您提供带有单词的列表（在 qdap 的情况下，是带有单词和频率的数据框列表。您可以使用 unlist 来完成剩下的工作，第一个选项和 lapply、索引和 unlist使用 qdap。qdap 方法适用于原始 Corpus：

选项#1：

apply(dtm, 1, function(x) unlist(dtm[["dimnames"]][2], 
    use.names = FALSE)[x == max(x)])

选项 #2 与 qdap：

library(qdap)
dat <- tm_corpus2df(crude)
tapply(stemmer(dat$text), dat$docs, freq_terms, top = 1, 
    stopwords = tm::stopwords("English"))

用lapply(WRAP_HERE, "[", 1) 包裹tapply 使两个答案在内容和格式上几乎相同。

编辑：添加了一个更精简使用 qdap 的示例：

FUN <- function(x) freq_terms(x, top = 1, stopwords = stopwords("English"))[, 1]
lapply(stemmer(crude), FUN)

## [[1]]
## [1] "oil"   "price"
## 
## [[2]]
## [1] "opec"
## 
## [[3]]
## [1] "canada"   "canadian" "crude"    "oil"      "post"     "price"    "texaco"  
## 
## [[4]]
## [1] "crude"
## 
## [[5]]
## [1] "estim"  "reserv" "said"   "trust" 
## 
## [[6]]
## [1] "kuwait" "said"  
## 
## [[7]]
## [1] "report" "say"   
## 
## [[8]]
## [1] "yesterday"
## 
## [[9]]
## [1] "billion"
## 
## [[10]]
## [1] "market" "price" 
## 
## [[11]]
## [1] "mln"
## 
## [[12]]
## [1] "oil"
## 
## [[13]]
## [1] "oil"   "price"
## 
## [[14]]
## [1] "oil"  "opec"
## 
## [[15]]
## [1] "power"
## 
## [[16]]
## [1] "oil"
## 
## [[17]]
## [1] "oil"
## 
## [[18]]
## [1] "dlrs"
## 
## [[19]]
## [1] "futur"
## 
## [[20]]
## [1] "januari"

【讨论】：

同意。 Ben，如果你不介意的话，我会将已接受的答案移至此。

【解决方案2】：

您快到了，将max 替换为which.max 以获得每个文档（即每行）出现频率最高的词条的列索引。然后使用该列索引向量对文档术语矩阵中的术语（或列名，种类）进行子集化。这将返回具有该文档最大频率的每个文档的实际术语（而不仅仅是频率值，就像您使用 max 时所做的那样）。所以，按照你的例子

maxterms<-apply(dtm, 1, which.max)
dtm$dimnames$Terms[maxterms]
[1] "oil"     "opec"    "canada"  "crude"   "said"    "said"    "report"  "oil"    
 [9] "billion" "oil"     "mln"     "oil"     "oil"     "oil"     "power"   "oil"    
[17] "oil"     "dlrs"    "futures" "january"

【讨论】：