二元概率答案

【问题标题】：Bigram probability二元概率
【发布时间】：2020-07-13 00:38:08
【问题描述】：

我有一个 Moby Dick Corpus，我需要计算二元“象牙腿”的概率。我知道这个命令给了我所有二元组的列表

bigrams = [w1+" "+w2 for w1,w2 in zip(words[:-1], words[1:])]

但是我怎样才能得到这两个词的概率呢？

【问题讨论】：

您能否更具体地说明两个词的概率是什么意思。通常，NLK 中的二元组被计算为条件概率。即P(W[n] | W[n-1]) 是你想要做的，还是别的什么？
是的，就是这样。当我需要从语料库中获取代码时，如何在代码上编写它？

标签： python pycharm n-gram

【解决方案1】：

您可以计算所有二元组并计算您要查找的特定二元组。 bigram 出现的概率 P(bigram) 只是这些的商。 word[1] 给 word[0] 的条件概率 P(w[1] | w[0]) 是二元组出现的次数与 w[0] 的计数的商。例如查看二元组('some', 'text')：

s = 'this is some text about some text but not some other stuff'.split()

bigrams = [(s1, s2) for s1, s2 in zip(s, s[1:])]

# [('this', 'is'),
#  ('is', 'some'),
# ('some', 'text'),
# ('text', 'about'),
# ...

number_of_bigrams = len(bigrams)
# 11

# how many times 'some' occurs 
some_count = s.count('some')
# 3

# how many times bigram occurs
bg_count = bigrams.count(('some', 'text'))
# 2

# probabily of 'text' given 'some' P(bigram | some)
# i.e. you found `some`, what's the probability that its' makes the bigram:
bg_count/some_count
# 0.666

# probabilty of bigram in text P(some text)
# i.e. pick a bigram at random, what's the probability it's your bigram:
bg_count/number_of_bigrams
# 0.181818

【讨论】：

嗨，马克，你的回答很有道理（我已经赞成），但为什么 P(w2/w1) = count(w2,w1)/count(w1)？我在任何地方都找不到答案