【发布时间】:2017-07-13 02:02:19
【问题描述】:
我正在尝试识别和汇总给定数据集的同义词。请参阅下面的示例数据。
library(tm)
library(SnowballC)
dataset <- c("dad glad accept large admit large accept dad big large big accept big accept dad dad Happy dad accept glad papa dad Happy dad glad dad dad papa admit Happy big accept accept big accept dad Happy admit Happy Happy glad Happy dad accept accept large daddy large accept large large large big daddy accept admit dad admit daddy dad admit dad admit Happy accept accept Happy daddy accept admit")
docs <- Corpus(VectorSource(dataset))
dtm <- TermDocumentMatrix(docs)
m <- as.matrix(dtm)
sort(rowSums(m),decreasing=TRUE)
结果:
accept dad happy admit large big daddy glad papa
15 14 9 8 8 6 4 4 2
我想使用我下载并安装的 wordnet 包查找上述每个单词的同义词。例如,要获得“接受”的同义词,我可以这样做:
library(wordnet)
setDict("C:/Program Files (x86)/WordNet/2.1/dict")
filter <- getTermFilter("ExactMatchFilter", "accept", TRUE)
terms <- getIndexTerms("VERB", 1, filter)
getSynonyms(terms[[1]])
结果:
[1] "accept" "admit" "assume" "bear" "consent" "go for" "have" "live with"
[9] "swallow" "take" "take on" "take over"
现在,我想合并这两个结果集,以便按以下方式对同义词进行分组。 标记给定组的最常用词(排名 1),然后按这些词分组,类似于:
id word word_count syn_group rank
1 accept 15 1 1
5 admit 8 1 2
2 dad 14 2 1
8 daddy 4 2 2
9 papa 2 2 3
3 happy 9 3 1
7 glad 4 3 2
4 large 8 4 1
6 big 6 4 2
然后可以像这样聚合
id word word_count
1 accept 15+8
2 dad 14+4+2
3 happy 9+4
4 large 8+6
最后的结果就是
id word word_count
1 accept 23
2 dad 20
3 large 14
4 happy 13
我遇到了几个问题,包括让 GetIndexTerms 遍历单词,无论它们是名词、动词等。希望这一切都有意义吗?任何帮助将非常感激。谢谢。
【问题讨论】:
-
仅供参考:转到 Wordnet.princeton.edu 并下载适用于您的操作系统的 WordNet 版本。安装好之后,你可以在library(wordnet)之后从GraveDigger的代码中获取。