不使用 NLTK 计算字符串中的二元组答案

【问题标题】：Counting Bigrams in a string not using NLTK不使用 NLTK 计算字符串中的二元组
【发布时间】：2023-03-09 15:14:01
【问题描述】：

我一直在尝试创建一个代码，该代码可以查看一个二元组在一个字符串中出现了多少次（如果你不知道，二元组是由两个词组成的，例如'如果你'或'你不' ）。我尝试将 .join 函数与切割列表一起使用，但是，它只返回一个单词而不是两个单词。

我使用了 .join 函数并使用了一个 for 循环，该循环将一直持续到 n-1（其中 n 是单词的长度）时间，并且它将连接两个列表，其中包含 n-1 和 n 之间的空格。

content_string = "This is a test to see whether or not this can         
effectively create bigrams"
words = content_string.lower()
punctuation = ["'", '"', ',', '.', '?', '!', ':', ';', '()','-']
words = "".join(i if i not in punctuation else "" for i in words)
words = words.split()

n=1
number = len(words)-1
for n in range(number):
    print(" ".join(words[n-1:n]))

预期的结果是它可以产生二元组，但实际出现的结果只是一元组（虽然，有趣的是，当我尝试使用字典并将二元组作为键时，它出现的次数为值，键仍然是 unigram，但与最初仅计算 unigram 相比，该值变为数字的两倍）。不导入 NLTK 库有哪些可能的选择？

【问题讨论】：

标签： python python-3.x string

【解决方案1】：

如果你想计算二元组，我建议你使用collections.Counter，只需更改代码的最后一部分：

bigrams = Counter(zip(words, words[1:]))
print(bigrams)

输出

Counter({('this', 'is'): 1, ('is', 'a'): 1, ('a', 'test'): 1, ('test', 'to'): 1, ('to', 'see'): 1, ('see', 'whether'): 1, ('whether', 'or'): 1, ('or', 'not'): 1, ('not', 'this'): 1, ('this', 'can'): 1, ('can', 'effectively'): 1, ('effectively', 'create'): 1, ('create', 'bigrams'): 1})

这里的关键是通过 zipping 单词获得二元组，并将其自身移位 1 (zip(words, words[1:]))。如果您希望将二元组作为字符串而不是元组，请执行以下操作：

bigrams = Counter(' '.join(bigram) for bigram in zip(words, words[1:]))

输出

Counter({'this is': 1, 'is a': 1, 'a test': 1, 'test to': 1, 'to see': 1, 'see whether': 1, 'whether or': 1, 'or not': 1, 'not this': 1, 'this can': 1, 'can effectively': 1, 'effectively create': 1, 'create bigrams': 1})

【讨论】：