您可以为此构建一个简单的“语言模型”。它将估计一个短语的概率,并将每个单词平均概率低的短语标记为异常。
对于单词概率估计,它可以使用平滑的单词计数。
这就是模型的样子:
import re
import numpy as np
from collections import Counter
class LanguageModel:
""" A simple model to measure 'unusualness' of sentences.
delta is a smoothing parameter.
The larger delta is, the higher is the penalty for unseen words.
"""
def __init__(self, delta=0.01):
self.delta = delta
def preprocess(self, sentence):
words = sentence.lower().split()
return [re.sub(r"[^A-Za-z]+", '', word) for word in words]
def fit(self, corpus):
""" Estimate counts from an array of texts """
self.counter_ = Counter(word
for sentence in corpus
for word in self.preprocess(sentence))
self.total_count_ = sum(self.counter_.values())
self.vocabulary_size_ = len(self.counter_.values())
def perplexity(self, sentence):
""" Calculate negative mean log probability of a word in a sentence
The higher this number, the more unusual the sentence is.
"""
words = self.preprocess(sentence)
mean_log_proba = 0.0
for word in words:
# use a smoothed version of "probability" to work with unseen words
word_count = self.counter_.get(word, 0) + self.delta
total_count = self.total_count_ + self.vocabulary_size_ * self.delta
word_probability = word_count / total_count
mean_log_proba += np.log(word_probability) / len(words)
return -mean_log_proba
def relative_perplexity(self, sentence):
""" Perplexity, normalized between 0 (the most usual sentence) and 1 (the most unusual)"""
return (self.perplexity(sentence) - self.min_perplexity) / (self.max_perplexity - self.min_perplexity)
@property
def max_perplexity(self):
""" Perplexity of an unseen word """
return -np.log(self.delta / (self.total_count_ + self.vocabulary_size_ * self.delta))
@property
def min_perplexity(self):
""" Perplexity of the most likely word """
return self.perplexity(self.counter_.most_common(1)[0][0])
您可以训练此模型并将其应用于不同的句子。
train = ["Lorem ipsum dolor sit amet, consetetur sadipscing elitr, sed diam nonumy eirmod tempor invidunt ut labore et dolore magna aliquyam erat, sed diam voluptua.",
"At vero eos et accusam et justo duo dolores et ea rebum.",
"Stet clita kasd gubergren, no sea takimata sanctus est Lorem ipsum dolor sit amet."]
test = ["Felix qui potuit rerum cognoscere causas", # an "unlikely" phrase
'sed diam nonumy eirmod sanctus sit amet', # a "likely" phrase
]
lm = LanguageModel()
lm.fit(train)
for sent in test:
print(lm.perplexity(sent).round(3), sent)
打印给你
8.525 Felix qui potuit rerum cognoscere causas
3.517 sed diam nonumy eirmod sanctus sit amet
您可以看到,第一个短语的“不寻常”比第二个更高,因为第二个短语是由训练词组成的。
如果您的“常用”短语语料库足够大,您可以从我使用的 1-gram 模型切换到 N-gram(对于英语,合理的 N 是 2 或 3)。或者,您可以使用递归神经网络来预测每个单词的概率,条件是所有先前的单词。但这需要一个非常庞大的训练语料库。
如果您使用高度灵活的语言(如土耳其语),则可以使用字符级 N-gram 代替单词级模型,或者仅使用 NLTK 的词形还原算法对文本进行预处理。