如何在不使用 nltk 库的情况下计算二元估计？答案

【问题标题】：How to calculate bigram estimation without using nltk library?如何在不使用 nltk 库的情况下计算二元估计？
【发布时间】：2018-03-21 02:13:25
【问题描述】：

所以，我是 python 的超级新手，我有这个计算二元组的项目，而不使用任何 python 包。我必须使用 python 2.7。这是我到目前为止所拥有的。它需要一个文件 hello 然后给出类似的输出 {'你好'，'怎么样'} 5 .现在对于二元估计，我必须将 5 除以 Hello 的计数（“Hello”在整个文本文件中出现了多少次）。我被困在任何帮助中！

f = open("hello.txt", 'r')
    dictionary={}
    for line in f:
        for word in line.split():
            items = line.split()
            bigrams = []
            for i in range(len(items) - 1):
                bigrams.append((items[i], items[i+1]))
                my_dict = {i:bigrams.count(i) for i in bigrams}
                # print(my_dict)
                with open('bigram.txt', 'wt') as out:
                    out.write(str(my_dict))
    f.close()

【问题讨论】：

见stackoverflow.com/questions/7591258/fast-n-gram-calculation和stackoverflow.com/questions/21883108/…和stackoverflow.com/questions/40373414/…
我需要二元估计......所有其他答案都只是给出二元。我需要它的概率。示例：计数（你好如何）/计数（你好）。你知道怎么做吗？
你需要一个 ngram 语言模型...
@alvas OP 正在尝试在不使用任何 NLP 包的情况下完成任务。我希望你把锁解开。
谢谢@Mohammed 这就是我想告诉他的。所有给定的解决方案只计算出现的二元组数，但不进行估计。这不像我没有尝试过。但我是 python 新手，得到了错误的答案。有些人就是不明白！

标签： python-2.7 nlp

【解决方案1】：

我用一个非常简单的代码来回答你的问题，只是为了说明。请注意，二元估计比您想象的要复杂一些。它需要以分而治之的方式完成。可以使用不同的模型进行估计，其中最常见的是隐马尔可夫模型，我在下面的代码中对此进行了解释。请注意，数据量越大，估计越好。我在 Brown Corpus 上测试了以下代码。

def bigramEstimation(file):
    '''A very basic solution for the sake of illustration.
       It can be calculated in a more sophesticated way.
       '''

    lst = [] # This will contain the tokens
    unigrams = {} # for unigrams and their counts
    bigrams = {} # for bigrams and their counts

    # 1. Read the textfile, split it into a list
    text = open(file, 'r').read()
    lst = text.strip().split()
    print 'Read ', len(lst), ' tokens...'

    del text # No further need for text var



    # 2. Generate unigrams frequencies
    for l in lst:
        if not l in unigrams:
            unigrams[l] = 1
        else:
            unigrams[l] += 1

    print 'Generated ', len(unigrams), ' unigrams...'  

    # 3. Generate bigrams with frequencies
    for i in range(len(lst) - 1):
        temp = (lst[i], lst[i+1]) # Tuples are easier to reuse than nested lists
        if not temp in bigrams:
            bigrams[temp] = 1
        else:
            bigrams[temp] += 1

    print 'Generated ', len(bigrams), ' bigrams...'

    # Now Hidden Markov Model
    # bigramProb = (Count(bigram) / Count(first_word)) + (Count(first_word)/ total_words_in_corpus)
    # A few things we need to keep in mind
    total_corpus = sum(unigrams.values())
    # You can add smoothed estimation if you want


    print 'Calculating bigram probabilities and saving to file...'

    # Comment the following 4 lines if you do not want the header in the file. 
    with open("bigrams.txt", 'a') as out:
        out.write('Bigram' + '\t' + 'Bigram Count' + '\t' + 'Uni Count' + '\t' + 'Bigram Prob')
        out.write('\n')
        out.close()


    for k,v in bigrams.iteritems():
        # first_word = helle in ('hello', 'world')
        first_word = k[0]
        first_word_count = unigrams[first_word]
        bi_prob = bigrams[k] / unigrams[first_word]
        uni_prob = unigrams[first_word] / total_corpus

        final_prob = bi_prob + uni_prob
        with open("bigrams.txt", 'a') as out:
            out.write(k[0] + ' ' + k[1] + '\t' + str(v) + '\t' + str(first_word_count) + '\t' + str(final_prob)) # Delete whatever you don't want to print into a file
            out.write('\n')
            out.close()




# Callings
bigramEstimation('hello.txt')

希望对你有帮助！

【讨论】：

另见cs.nyu.edu/courses/spring17/CSCI-UA.0480-009/…
感谢您的回复。但我觉得差一点。所以如果我有文字。 “Hello Hello How”对于二元组 P(How | Hello) 它应该计算 (Hello How) 的计数，即 1 除以 (Hello) 的计数，即 2。概率 1/2。
你好你好怎么样？
我应该得到 .5
我问的是你得到的分数不是你应该得到的分数吗？