当我在文档分析中计算句子分数时，我应该计算重复的单词吗？答案

【问题标题】：should I count repeated words when I compute score of sentences in document analysis?当我在文档分析中计算句子分数时，我应该计算重复的单词吗？
【发布时间】：2015-02-02 01:28:41
【问题描述】：

我正在参考书学习用python分析文档。当我阅读书中的一些代码时，我感到很困惑，这里的代码是： Example 6. A document summarization algorithm based principally upon sentence detection and frequency analysis within sentences

让我困惑的是：

    s = [...words of sentence...]
    word_idx = []

    # For each word in the word list...
    for w in important_words:
        try:
            # Compute an index for where any important words occur in the sentence.

            word_idx.append(s.index(w))
        except ValueError, e: # w not in this particular sentence
            pass

    word_idx.sort()

为什么不使用这个：

        for i in range(len(s)):
            w = s[i]
            if w in important_words:
                word_idx.append(i)

它们之间的区别在于：前者不计重复词，后者计，例如：

s = [u'fes', u'watch', u'\u2014', u'e-paper', u'watch', u',', u'including', u'strap', u'.']

前者打印[0, 1, 2, 3, 5, 6, 7, 8]，后者打印[0, 1, 2, 3, 4, 5, 6, 7, 8]

那么当我计算句子的分数时，我应该计算重复的单词吗？

【问题讨论】：

我很困惑为什么你得到..., 5, 6, ... 而不是..., 4, 6, ... :-)
@AaronDigulla 很抱歉我没有说清楚。 s[4] = u'watch'，你也可以看到s[2] = u'watch'，所以前一种算法不会将4附加到word_idx
哎呀，你是对的。我在脑子里数的时候索引错了。

标签： python algorithm nlp data-analysis

【解决方案1】：

如果您继续阅读Example 6. A document summarization algorithm based principally upon sentence detection and frequency analysis within sentences，您将看到以下行

for c in clusters:
    significant_words_in_cluster = len(c)
    total_words_in_cluster = c[-1] - c[0] + 1
    score = 1.0 * significant_words_in_cluster \
        * significant_words_in_cluster / total_words_in_cluster

，而 c 是您比较的列表：

“前者打印 [0, 1, 2, 3, 5, 6, 7, 8] 而后者打印 [0, 1, 2, 3, 4, 5, 6, 7, 8]”。

虽然“total_words_in_cluster”保持不变，但对于每个列表，“significant_words_in_cluster”不同。这会影响集群的“分数”。

作者创建集群列表的方式是进一步计算的选择。我认为，您的尝试本身似乎是合法的。

顺便说一句。还有集群的创建方式，

if word_idx[i] - word_idx[i - 1] < CLUSTER_THRESHOLD:
     cluster.append(word_idx[i])

取决于 idx_s 列表中的值...

【讨论】：

是的，这正是我感到困惑的原因，我确实认为我们需要在处理文档时计算重复的单词。

【解决方案2】：

我的感觉是，当您有很多 important_words 时，第一个算法效率低下。您的算法效率更高，但它确实计算了两次似乎不正确的单词。 watch watch watch watch 应该比Bobby has a watch 获得更高的分数吗？

答案取决于您的需求。自然文本分析没有“最佳”解决方案。在分析 HTML 页面时，Google 的需求与考古学家不同。

所以我认为这可以归结为：有没有一种更有效的算法可以产生与书中的结果相同的结果？

是的：使用您的代码并将单词放入set() 以删除重复项：

s = set(s)

根据您的 Python 版本，此步骤可能会重新排序单词，但我认为这并不重要，因为本书中的代码在第一次循环后不使用 s。

如果顺序很重要，您需要过滤列表。

【讨论】：

我怀疑这个算法有问题，我认为我们应该计算重复的单词。其实我觉得顺序也很重要，因为代码total_words_in_cluster = c[-1] - c[0] + 1依赖顺序，所以我觉得s = set(s)不行。