使用 r 计算文本中 n-gram 的频率答案

【问题标题】：Count frequency of n-gram in text using r使用 r 计算文本中 n-gram 的频率
【发布时间】：2016-04-12 04:12:42
【问题描述】：

我正在使用 R 来阅读文本。一篇文章由100个句子组成，然后将它放在一个列表中，列表如下：

[[1]]

[1] "WigWagCo: For #TBT here's a video of Travis McCollum (Co-Founder and COO of WigWag) at #SXSW2016

[[2]]

[1] "chrisreedfilm: RT @hammertonail: #SXSW2016 doc THE SEER: A PORTRAIT OF WENDELL BERRY gets reviewed by @chrisreedfilm 

[[3]]

[1] "iamscottrandell: RT @therevue: Take a jaunt down #MemoriesofSXSW &amp; read the stories of @JRNelsonMusic @thegillsmusic &amp; @TheBlancosMusic 
...
...

[[99]]

[1] "SunPowerTalent: SunPower #Clerical #Job: Supply Chain Intern (#Austin, TX) 

[[100]]

[1] "SunPowerTalent: #Finance #Job alert: General Ledger Accountant | SunPower

列表中的每个对象都是来自同一文本的“句子”。如何计算本文中所有 3-gram 的频率并知道每个 3-gram 来自哪个句子？

非常感谢。

【问题讨论】：

标签： r text n-gram

【解决方案1】：

您可以为此使用 R 包 textcat (https://CRAN.R-project.org/package=textcat)。如果您的 100 个句子列表称为 x，您只需这样做：

library("textcat")
n3gram <- textcat_profile_db(x, n = 3)

这是一个包含按频率排序的 3-gram 的 100 个元素（对应于原始句子）的列表。有关更多详细信息和示例，请参阅?textcat_profile_db。

【讨论】：