在 R 中，我如何计算语料库中的特定单词？答案

【问题标题】：In R, how can I count specific words in a corpus?在 R 中，我如何计算语料库中的特定单词？
【发布时间】：2021-03-16 07:32:59
【问题描述】：

我需要计算特定单词的出现频率。很多话。我知道如何通过将所有单词放在一个组中来做到这一点（见下文），但我想获得每个特定单词的计数。

这是我目前拥有的：

library(quanteda)
#function to count 
strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))}
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
df<-data.frame(txt)
mydict<-dictionary(list(all_terms=c("clouds","storms")))
corp <- corpus(df, text_field = 'txt')
#count terms and save output to "overview"
overview<-dfm(corp,dictionary = mydict)
overview<-convert(overview, to ='data.frame')

如您所见，“云”和“风暴”的计数在生成的 data.frame 中的“all_terms”类别中。有没有一种简单的方法可以在各个列中获取“mydict”中所有术语的计数，而无需为每个单独的术语编写代码？

E.g.
clouds, storms
1, 1

Rather than 
all_terms
2

【问题讨论】：

标签： r nlp data-science quanteda

【解决方案1】：

您希望将字典值用作tokens_select() 中的pattern，而不是在查找函数中使用它们，dfm(x, dictionary = ...) 就是这样做的。方法如下：

library("quanteda")
## Package version: 2.1.2

txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."

mydict <- dictionary(list(all_terms = c("clouds", "storms")))

这将创建 dfm，其中每列是术语，而不是字典键：

dfmat <- tokens(txt) %>%
  tokens_select(mydict) %>%
  dfm()

dfmat
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
##        features
## docs    clouds storms
##   text1      1      1

您可以通过两种方式将其转换为计数的 data.frame：

convert(dfmat, to = "data.frame")
##   doc_id clouds storms
## 1  text1      1      1

textstat_frequency(dfmat)
##   feature frequency rank docfreq group
## 1  clouds         1    1       1   all
## 2  storms         1    1       1   all

虽然字典是 pattern 的有效输入（请参阅 ?pattern），但您也可以将值的字符向量提供给 tokens_select()：

# no need for dictionary
tokens(txt) %>%
  tokens_select(c("clouds", "storms")) %>%
  dfm()
## Document-feature matrix of: 1 document, 2 features (0.0% sparse).
##        features
## docs    clouds storms
##   text1      1      1

【讨论】：

【解决方案2】：

您可以结合使用 tidytext 中的 unnest_tokens() 函数和 tidyr 中的 pivot_wider() 来获取单独列中每个单词的计数：

library(dplyr)
library(tidytext)
library(tidyr)

txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."

mydict <- c("clouds","storms")

df <- data.frame(text = txt) %>% 
  unnest_tokens(word, text) %>%
  count(word) %>% 
  pivot_wider(names_from = word, values_from = n)

df %>% select(mydict)

# A tibble: 1 x 2
  clouds storms
   <int>  <int>
1      1      1

【讨论】：