在 Pandas DataFrame 中形成 Bigrams 不仅是彼此相邻的单词答案

【问题标题】：Form Bigrams in Pandas DataFrame not only words next to each other在 Pandas DataFrame 中形成 Bigrams 不仅是彼此相邻的单词
【发布时间】：2020-10-13 07:53:42
【问题描述】：

我有一个巨大而简单的 Pandas DataFrame。行看起来像这样：

index   Text
1   This is a sample text
2   I am a test text
3   this is a test

我想为每一行创建 bigrams。我做了什么：

from nltk.collocations import *

def create_bigram(word_list):
    finder = BigramCollocationFinder.from_words(word_list)
    return finder.ngram_fd.items()

test_str = "This is a sample text".split()
create_bigram(test_str)

[(('This', 'is'), 1),
 (('is', 'a'), 1),
 (('a', 'sample'), 1),
 (('sample', 'text'), 1)]

但我想记录每一行中每个单词的所有外观，而不仅仅是彼此相邻的单词。

像这样：

index   Bigrams
1   (this, is), (this, a), (this, sample), (this, text), (is, a), (is, sample), (is, text)...

等等……

我希望它能够查看单词在一个 DataFrame 行中一起出现的频率。

是否有来自 nltk（或其他 nlp 库）的某种默认函数来执行此操作，还是我必须自己执行此操作？

除了二元组或三元组或 ngram 之外，我找不到任何东西，但它们都只计算直接邻居，对吗？

为超过 300,000 行比“这是一个示例文本”更长的文本执行一个简单的嵌套 for 循环来计算每个外观非常耗时...

编辑：不知何故，我想错过明显的东西，但我无法看到它。

【问题讨论】：

标签： python pandas nltk n-gram

【解决方案1】：

试试这个：

from itertools import permutations
import pandas as pd


def create_bigram(word_list):
    split_test_str = word_list.split()
    perms = [','.join(element) for element in permutations(split_test_str, 2)]   
    df = pd.DataFrame(data=[perms])
    print(df)


test_str = 'This is a sample text'
create_bigram(test_str)

最简单的方法是创建字符串的排列以获得所需的结果。

这可以通过 itertools 的函数置换来实现。您可以在https://docs.python.org/3/library/itertools.html上阅读更多关于该功能的信息

【讨论】：

@XEmporea 对你有帮助吗？