没有重复单词的二元组答案

【问题标题】：Bigram without repeated words没有重复单词的二元组
【发布时间】：2021-09-07 09:54:10
【问题描述】：

我想通过计算二元组来分析文本。不幸的是，我的文本中有很多重复的单词（例如：hello hello），我不想被视为二元组。

我的代码如下：

b = nltk.collocations.BigramCollocationFinder.from_words('this this is is a a test test'.split())
b.ngram_fd.keys()

>> dict_keys([('this', 'this'), ('this', 'is'), ('is', 'is'), ('is', 'a'), ('a', 'a'), ('a', 'test'), ('test', 'test')])

但我希望输出为：

>> [('a', 'test'), ('is', 'a'), ('this', 'is')]

你有什么建议，也使用不同的库吗？预先感谢您的任何帮助。弗朗西斯卡

【问题讨论】：

标签： python nltk word-cloud countvectorizer

【解决方案1】：

试试：

result_cleared = [x for x in b.ngram_fd.keys() if x[0] != x[1]]

编辑：如果您的文本存储在 DataFrame 中，您可以执行以下操作：

# the dummy data from your comment
df=pd.DataFrame({'Text': ['this is a stupid text with no no no sense','this song says na na na','this is very very very very annoying']})

def create_bigrams(text):
    b = nltk.collocations.BigramCollocationFinder.from_words(text.split())
    return [x for x in b.ngram_fd.keys() if x[0] != x[1]]

df["bigrams"] = df["Text"].apply(create_bigrams)
df["bigrams"].apply(print)

这首先将包含二元组的列添加到 DataFrame，然后打印列值。如果您只想要输出而不操作df，请将最后两行替换为：

df["Text"].apply(create_bigrams).apply(print)

【讨论】：

谢谢乔治！如果文本在数据框中，您还知道如何翻译吗？
@botti23 不客气！这取决于 - 您的意思是文本是否位于单个 DataFrame 单元格中，或者它们的单词是否位于单独的单元格中（在单个或多个列中）？你能提供虚拟数据吗？
嗨@georgy-kopshteyn，我的意思是如果文本在单个数据框单元格内，就像在这种情况下：df = pd.DataFrame({'Text': ['这是一个愚蠢的文本，没有no no no sense','这首歌说na na na','这非常非常非常非常烦人']})
@botti23 我更新了我的帖子。让我知道这是否适合您。

【解决方案2】：

您可以在传递给函数nltk.collocations.BigramCollocationFinder.from_words之前删除重复的单词

words = 'this this is is a a test test'.split()
removed_duplicates = [first for first, second in zip(words, ['']+words) if first != second]

output:

['this', 'is', 'a', 'test']

然后做：

b = nltk.collocations.BigramCollocationFinder.from_words(removed_duplicates)
b.ngram_fd.keys()

【讨论】：