相关词的概率计数/频率？答案

【问题标题】：Probability count/frequency of related words?相关词的概率计数/频率？
【发布时间】：2012-10-01 07:08:34
【问题描述】：

我正在寻找一种方法来为具有共同词根/含义的单个单词生成数值概率值。

用户将使用“舞者”、“跳舞”、“跳舞”等词生成内容。

如果 "dancer" 被提交 30 次，并且跳舞 5 次，我只需要一个值 "dance:35" 来捕获所有这些。

但是当用户也提交“accordance”之类的词时，它不应该影响我的“dance”计数，而是将“according”和“accordingly”等词添加到单独的计数中。

另外，我没有预先定义的词根列表来查找。我需要根据用户生成的内容动态创建它。

所以我的问题是，最好的方法是什么？我敢肯定不会有完美的解决方案，但我认为这里有人可能会想出比我更好的方法。

到目前为止，我的想法是假设最有意义的单词至少有 3 或 4 个字母。因此，对于我遇到的每个长度大于 4 的单词，将其缩减为 4（“dancers”变为“danc”），检查我的单词列表以查看我以前是否遇到过它，如果是 - 增加它的计数，如果没有 - 将其添加到该列表中，重复。

我看到这里有一些类似的问题。但我还没有找到任何考虑根源的答案，我可以在 python 中实现。答案似乎是针对其中一个。

【问题讨论】：

您只假设后缀。前缀呢？例如。 - 不成熟、不真实、未完成等。也许你可以看看nltk.org

标签： python string comparison counter frequency

【解决方案1】：

Java 库不需要 Python 包装器，nltk 有 Snowball！ :)

>>> from nltk.stem import SnowballStemmer as SS
>>> stemmer = SS('english')
>>> stemmer.stem('dance')
u'danc'
>>> stemmer.stem('danced')
u'danc'
>>> stemmer.stem('dancing')
u'danc'
>>> stemmer.stem('dancer')
u'dancer'
>>> stemmer.stem('accordance')
u'accord'

词干并不总是会给你确切的根源，但它是一个很好的开始。

以下是使用词干的示例。我正在构建stem: (word, count) 的字典，同时为每个词干选择最短的单词。 So ['dancing', 'danced', 'dances', 'dance', 'dancer'] converts to {'danc': ('dance', 4), 'dancer': ('dancer', 1)}

示例代码：（取自http://en.wikipedia.org/wiki/Dance的文本）

import re
from nltk.stem import SnowballStemmer as SS

text = """Dancing has evolved many styles. African dance is interpretative.
Ballet, ballroom (such as the waltz), and tango are classical styles of dance
while square dancing and the electric slide are forms of step dances.
More recently evolved are breakdancing and other forms of street dance,
often associated with hip hop culture.
Every dance, no matter what style, has something in common.
It not only involves flexibility and body movement, but also physics.
If the proper physics are not taken into consideration, injuries may occur."""
#extract words
words = [word.lower() for word in re.findall(r'\w+',text)]

stemmer = SS('english')
counts = dict()

#count stems and extract shortest words possible
for word in words:
    stem = stemmer.stem(word)
    if stem in counts:
        shortest,count = counts[stem]
        if len(word) < len(shortest):
            shortest = word
        counts[stem] = (shortest,count+1)
    else:
        counts[stem]=(word,1)

#convert {key: (word, count)} to [(word, count, key)] for convenient sort and print
output = [wordcount + (root,) for root,wordcount in counts.items()]
#trick to sort output by count (descending) & word (alphabetically)
output.sort(key=lambda x: (-x[1],x[0]))
for item in output:
    print '%s:%d (Root: %s)' % item

输出：

dance:7 (Root: danc)
and:4 (Root: and)
are:4 (Root: are)
of:3 (Root: of)
style:3 (Root: style)
the:3 (Root: the)
evolved:2 (Root: evolv)
forms:2 (Root: form)
has:2 (Root: has)
not:2 (Root: not)
physics:2 (Root: physic)
african:1 (Root: african)
also:1 (Root: also)
as:1 (Root: as)
associated:1 (Root: associ)
ballet:1 (Root: ballet)
ballroom:1 (Root: ballroom)
body:1 (Root: bodi)
breakdancing:1 (Root: breakdanc)
---truncated---

我不建议针对您的特定需求进行词形还原：

>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('dance')
'dance'
>>> lmtzr.lemmatize('dancer')
'dancer'
>>> lmtzr.lemmatize('dancing')
'dancing'
>>> lmtzr.lemmatize('dances')
'dance'
>>> lmtzr.lemmatize('danced')
'danced'

子字符串不是一个好主意，因为它总是会在某个点失败，而且很多时候都失败了。

固定长度：伪词“dancitization”和“dancendence”将分别匹配 4 个和 5 个字符的“dance”。
比率：低比率将返回假货（如上）
比率：高比率将不够匹配（例如“正在运行”）

但是有了词干，你会得到：

>>> stemmer.stem('dancitization')
u'dancit'
>>> stemmer.stem('dancendence')
u'dancend'
>>> #since dancitization gives us dancit, let's try dancization to get danc
>>> stemmer.stem('dancization')
u'dancize'
>>> stemmer.stem('dancation')
u'dancat'

对于词干“danc”来说，这是一个令人印象深刻的不匹配结果。即使考虑到 'dancer' 并不源于 'danc'，总体准确度还是相当高的。

我希望这可以帮助您入门。

【讨论】：

【解决方案2】：

您要查找的内容也称为词的词干（比语言“词根”更具技术性）。您假设没有完美的解决方案是正确的，所有方法要么分析不完善，要么缺乏覆盖。基本上最好的方法是使用包含词干的单词列表或词干算法。在此处查看第一个答案以获得基于 python 的解决方案：

How do I do word Stemming or Lemmatization?

我在所有基于 Java 的项目中都使用 Snowball，它非常适合我的目的（它也非常快，并且涵盖了多种语言）。它似乎也有一个 Python 包装器：

http://snowball.tartarus.org/download.php

【讨论】：