【发布时间】:2021-03-16 07:32:59
【问题描述】:
我需要计算特定单词的出现频率。很多话。我知道如何通过将所有单词放在一个组中来做到这一点(见下文),但我想获得每个特定单词的计数。
这是我目前拥有的:
library(quanteda)
#function to count
strcount <- function(x, pattern, split){unlist(lapply(strsplit(x, split),function(z) na.omit(length(grep(pattern, z)))))}
txt <- "Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents."
df<-data.frame(txt)
mydict<-dictionary(list(all_terms=c("clouds","storms")))
corp <- corpus(df, text_field = 'txt')
#count terms and save output to "overview"
overview<-dfm(corp,dictionary = mydict)
overview<-convert(overview, to ='data.frame')
如您所见,“云”和“风暴”的计数在生成的 data.frame 中的“all_terms”类别中。有没有一种简单的方法可以在各个列中获取“mydict”中所有术语的计数,而无需为每个单独的术语编写代码?
E.g.
clouds, storms
1, 1
Rather than
all_terms
2
【问题讨论】:
标签: r nlp data-science quanteda