R：计算自己字典中单词的频率答案

【问题标题】：R: Counting frequency of words from own dictionaryR：计算自己字典中单词的频率
【发布时间】：2022-07-16 04:18:20
【问题描述】：

我已经分析了一些 Instagram 帖子，并且已经计算了每个帖子的字数（每一行都是一个帖子），如下所示： Data

现在我要做的是计算每个帖子中的所有绿色/可持续字词，并将这些绿色字词添加为额外的列。我自己创建了一个词典，其中所有绿色单词的极性为 1，非绿色单词的极性为 0。

Lexicon

我该怎么做？

【问题讨论】：

欢迎来到 SO！请使用dput() 而不是图片发布您的数据reproducible example，以便人们可以帮助您。
如果您向reproducible example 提供可用于测试和验证可能解决方案的示例输入和所需输出，则更容易为您提供帮助。请不要将数据或代码作为图像发布，因为我们无法轻松地将这些值复制/粘贴到 R 中进行测试。
这里的现有答案也可能会有所帮助：stackoverflow.com/questions/7597559/…

标签： r dictionary frequency lexicon

【解决方案1】：

来自stringr 的str_count() 可以帮助解决这个问题（以及更多基于字符串的任务，请参阅this R4DS chapter。

library(string)

# Create a reproducible example
dat <- data.frame(Post = c(
      "This is a sample post without any target words",
      "Whilst this is green!",
      "And this is eco-friendly",
      "This is green AND eco-friendly!"))
lexicon <- data.frame(Word = c("green", "eco-friendly", "neutral"),
                      Polarity = c(1, 1, 0))

# Extract relevant words from lexicon
green_words <- lexicon$Word[lexicon$Polarity == 1]

# Create new variable
dat$n_green_words <- str_count(dat$Post, paste(green_words, collapse = "|"))

dat

输出：

#>                                             Post n_green_words
#> 1 This is a sample post without any target words             0
#> 2                          Whilst this is green!             1
#> 3                       And this is eco-friendly             1
#> 4                This is green AND eco-friendly!             2

^{由reprex package 创建于 2022-07-15 (v2.0.1)}

【讨论】：