使用“常用短语包”查找不寻常的短语答案

【问题标题】：Finding unusual phrases using a "bag of usual phrases"使用“常用短语包”查找不寻常的短语
【发布时间】：2018-02-22 22:02:05
【问题描述】：

我的目标是输入一组短语，如

array = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.","At vero eos et accusam et justo duo dolores et ea rebum.","Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]

并向它呈现一个新短语，例如

"Felix qui potuit rerum cognoscere causas"

我想让它告诉我这是否可能是上述array 中的组的一部分。

我找到了如何检测词的频率，但我如何找到unsimilarity？毕竟，我的目标是找到不寻常的短语，而不是某些词的频率。

【问题讨论】：

对于英语短语（我不确定其他语言），您可以使用wordnet from nltk.corpus...您在问什么语言？

标签： python python-3.x pandas scikit-learn text-mining

【解决方案1】：

您可以为此构建一个简单的“语言模型”。它将估计一个短语的概率，并将每个单词平均概率低的短语标记为异常。

对于单词概率估计，它可以使用平滑的单词计数。

这就是模型的样子：

import re
import numpy as np
from collections import Counter

class LanguageModel:
    """ A simple model to measure 'unusualness' of sentences. 
    delta is a smoothing parameter. 
    The larger delta is, the higher is the penalty for unseen words.
    """
    def __init__(self, delta=0.01):
        self.delta = delta
    def preprocess(self, sentence):
        words = sentence.lower().split()
        return [re.sub(r"[^A-Za-z]+", '', word) for word in words]
    def fit(self, corpus):
        """ Estimate counts from an array of texts """
        self.counter_ = Counter(word 
                                 for sentence in corpus 
                                 for word in self.preprocess(sentence))
        self.total_count_ = sum(self.counter_.values())
        self.vocabulary_size_ = len(self.counter_.values())
    def perplexity(self, sentence):
        """ Calculate negative mean log probability of a word in a sentence 
        The higher this number, the more unusual the sentence is.
        """
        words = self.preprocess(sentence)
        mean_log_proba = 0.0
        for word in words:
            # use a smoothed version of "probability" to work with unseen words
            word_count = self.counter_.get(word, 0) + self.delta
            total_count = self.total_count_ + self.vocabulary_size_ * self.delta
            word_probability = word_count / total_count
            mean_log_proba += np.log(word_probability) / len(words)
        return -mean_log_proba

    def relative_perplexity(self, sentence):
        """ Perplexity, normalized between 0 (the most usual sentence) and 1 (the most unusual)"""
        return (self.perplexity(sentence) - self.min_perplexity) / (self.max_perplexity - self.min_perplexity)

    @property
    def max_perplexity(self):
        """ Perplexity of an unseen word """
        return -np.log(self.delta / (self.total_count_ + self.vocabulary_size_ * self.delta))

    @property
    def min_perplexity(self):
        """ Perplexity of the most likely word """
        return self.perplexity(self.counter_.most_common(1)[0][0])

您可以训练此模型并将其应用于不同的句子。

train = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
                 "At vero eos et accusam et justo duo dolores et ea rebum.",
                 "Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]
test = ["Felix qui potuit rerum cognoscere causas", # an "unlikely" phrase
        'sed diam nonumy eirmod sanctus sit amet', # a "likely" phrase
       ]

lm = LanguageModel()
lm.fit(train)

for sent in test:
    print(lm.perplexity(sent).round(3), sent)

打印给你

8.525 Felix qui potuit rerum cognoscere causas
3.517 sed diam nonumy eirmod sanctus sit amet

您可以看到，第一个短语的“不寻常”比第二个更高，因为第二个短语是由训练词组成的。

如果您的“常用”短语语料库足够大，您可以从我使用的 1-gram 模型切换到 N-gram（对于英语，合理的 N 是 2 或 3）。或者，您可以使用递归神经网络来预测每个单词的概率，条件是所有先前的单词。但这需要一个非常庞大的训练语料库。

如果您使用高度灵活的语言（如土耳其语），则可以使用字符级 N-gram 代替单词级模型，或者仅使用 NLTK 的词形还原算法对文本进行预处理。

【讨论】：

似乎随机文字总是导致 12.124，所以这似乎是最不寻常的短语，而大约 4 似乎是最高的相似度。因此，8.062 的开关可能有助于确定正常性和异常性，您的看法是什么？
是的，像这样。异常的上下限（4 和 12.124）取决于训练语料库和增量，因此您应该将阈值调整为您要使用的实际数据。
所以我在代码中添加了dissimilar = lm.perplexity("jkdhfl dgksh dfkslgh dskflg dskjfgkljs dn").round(3) 和similar = lm.perplexity(train[0]).round(3)，然后是dissimilarity_in_per_cent = ((100/(dissimilar - similar))*lm.perplexity(sent).round(3))-((100/(dissimilar - similar))*similar)。不幸的是，我有时会得到负百分比来测试我的训练数组的其他句子。如果不遍历每个train，我如何获得最低的相似度？
@VitalisHommel，我已经扩展了我的课程来计算任何句子的相对“差异”。

【解决方案2】：

要查找句子中的常用短语，您可以使用Gensim Phrase (collocation) detection

但如果你想检测不寻常的短语，也许你会用正则表达式描述一些词性组合模式和对输入句子进行词性标注您将能够提取与您的模式匹配的看不见的单词（短语）。

【讨论】：