【问题标题】：How can I calculate perplexity using nltk如何使用 nltk 计算困惑度
【发布时间】：2019-03-01 09:48:24
【问题描述】：

我尝试对文本进行一些处理。这是我的代码的一部分：

fp = open(train_file)
raw = fp.read()
sents = fp.readlines()
words = nltk.tokenize.word_tokenize(raw)
bigrams = ngrams(words,2, left_pad_symbol='<s>', right_pad_symbol=</s>)
fdist = nltk.FreqDist(words)

在nltk 的旧版本中，我在StackOverflow 上找到了此代码，用于perplexity

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) 
lm = NgramModel(5, train, estimator=estimator)
print("len(corpus) = %s, len(vocabulary) = %s, len(train) = %s, len(test) = %s" % ( len(corpus), len(vocabulary), len(train), len(test) ))
print("perplexity(test) =", lm.perplexity(test))

但是，此代码不再有效，我在nltk 中没有找到任何其他用于此目的的包或函数。我应该实施吗？

【问题讨论】：

标签： python-3.x nltk

【解决方案1】：

困惑

假设我们有一个模型，该模型将一个英语句子作为输入，并给出一个概率分数，该分数对应于它是有效英语句子的可能性。我们想确定这个模型有多好。一个好的模型应该给有效的英语句子打高分，给无效的英语句子打低分。困惑度是一种常用的衡量标准，用于量化这种模型的“好”程度。如果一个句子 s 包含 n 个单词，则表示困惑

建模概率分布p（建立模型）

可以使用概率链规则进行扩展

所以给定一些数据（称为训练数据），我们可以计算出上述条件概率。然而，实际上这是不可能的，因为它需要大量的训练数据。然后我们假设计算

假设：所有单词都是独立的（unigram）

假设：一阶马尔可夫假设（二元组）

下一个词只取决于上一个词

假设：n阶马尔可夫假设（ngram）

下一个单词仅取决于前面的 n 个单词

MLE 估计概率

最大似然估计（MLE）是估计个体概率的一种方法

一元组

在哪里

count(w) 是单词 w 在训练数据中出现的次数
count(vocab) 是训练数据中唯一单词（称为词汇表）的数量。

比格拉姆

在哪里

count(w_{i-1}, w_i) 是单词 w_{i-1}, w_i 在训练数据中以相同序列（二元组）一起出现的次数
count(w_{i-1}) 是单词 w_{i-1} 在训练数据中出现的次数。 w_{i-1} 称为上下文。

计算困惑度

正如我们在上面看到的，$p(s)$ 是通过将许多小数相乘来计算的，因此由于计算机上浮点数的精度有限，它在数值上不稳定。让我们使用 log 的好属性来简单化它。我们知道

示例：Unigram 模型

训练数据 ["an apple", "an orange"] 词汇：[an, apple, orange, UNK]

MLE 估计

对于测试句“一个苹果”

l =  (np.log2(0.5) + np.log2(0.25))/2 = -1.5
np.power(2, -l) = 2.8284271247461903

对于测试句“an ant”

l =  (np.log2(0.5) + np.log2(0))/2 = inf

代码

import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

train_sentences = ['an apple', 'an orange']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in train_sentences]
n = 1
train_data, padded_vocab = padded_everygram_pipeline(n, tokenized_text)
model = MLE(n)
model.fit(train_data, padded_vocab)

test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) 
                for sent in test_sentences]

test_data, _ = padded_everygram_pipeline(n, tokenized_text)
for test in test_data:
    print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])

test_data, _ = padded_everygram_pipeline(n, tokenized_text)

for i, test in enumerate(test_data):
  print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))

示例：Bigram 模型

火车数据：“一个苹果”、“一个橙子” 填充训练数据：“(s) an apple (/s)”、“(s) an orange (/s)” 词汇 : (s), (/s) an, apple, orange, UNK

MLE 估计

对于测试句“an apple” 填充：“(s) an apple (/s)”

l =  (np.log2(p(an|<s> ) + np.log2(p(apple|an) + np.log2(p(</s>|apple))/3 = 
(np.log2(1) + np.log2(0.5) + np.log2(1))/3 = -0.3333
np.power(2, -l) = 1.

对于测试句“an ant” 填充：“(s) an ant (/s)”

l =  (np.log2(p(an|<s> ) + np.log2(p(ant|an) + np.log2(p(</s>|ant))/3 = inf

代码

import nltk
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE
from nltk.lm import Vocabulary

train_sentences = ['an apple', 'an orange']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in train_sentences]

n = 2
train_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
words = [word for sent in tokenized_text for word in sent]
words.extend(["<s>", "</s>"])
padded_vocab = Vocabulary(words)
model = MLE(n)
model.fit(train_data, padded_vocab)

test_sentences = ['an apple', 'an ant']
tokenized_text = [list(map(str.lower, nltk.tokenize.word_tokenize(sent))) for sent in test_sentences]

test_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for test in test_data:
    print ("MLE Estimates:", [((ngram[-1], ngram[:-1]),model.score(ngram[-1], ngram[:-1])) for ngram in test])

test_data = [nltk.bigrams(t,  pad_right=True, pad_left=True, left_pad_symbol="<s>", right_pad_symbol="</s>") for t in tokenized_text]
for i, test in enumerate(test_data):
  print("PP({0}):{1}".format(test_sentences[i], model.perplexity(test)))

【讨论】：

解释得很好。
@mujjiga，所以在二元模型和更高的词汇表中，在计算概率时不起作用？
你知道这个困惑是否是用平滑计算的吗？如果测试文本中存在 not seen word 会怎样？
@SzymonRoziewski 如果使用平滑计算困惑度，那么对于未知单词，输出不会是inf，而是一个更大的值。只需使用Laplace (from nltk.lm import Laplace) 而不是MLE。