如何计算 Pandas 中重复频率最高的短语答案

【问题标题】：How count the most frequently repeated phrases in Pandas如何计算 Pandas 中重复频率最高的短语
【发布时间】：2020-05-19 03:10:20
【问题描述】：

我有一个带有一个文本列的 Pandas 数据框。我想统计一下本专栏中最常见的短语。例如，从文本中可以看到，a very good movie、last night 等短语出现了很多时间。我认为有一种定义 n-gram 的方法，例如该短语介于 3 到 5 个单词之间，但我不知道该怎么做。

import pandas as pd


text = ['this is a very good movie that we watched last night',
        'i have watched a very good movie last night',
        'i love this song, its amazing',
        'what should we do if he asks for it',
        'movie last night was amazing',
        'a very nice song was played',
        'i would like to se a good show',
        'a good show was on tv last night']

df = pd.DataFrame({"text":text})
print(df)

所以我的目标是对出现很多次的短语（3-5 个单词）进行排名

【问题讨论】：

标签： python pandas nlp

【解决方案1】：

列表理解中的第一个split文本并展平为vals，然后创建ngrams，传递给Series，最后使用Series.value_counts：

from nltk import ngrams
vals = [y for x in df['text'] for y in x.split()]

n = [3,4,5]
a = pd.Series([y for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
(a, good, show)                      2
(movie, last, night)                 2
(a, very, good)                      2
(last, night, i)                     2
(a, very, good, movie)               2
                                    ..
(should, we, do)                     1
(a, very, nice, song, was)           1
(asks, for, it, movie, last)         1
(this, song,, its, amazing, what)    1
(i, have, watched, a)                1
Length: 171, dtype: int64

或者如果元组应该用空格连接：

n = [3,4,5]
a = pd.Series([' '.join(y) for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
last night i                  2
a good show                   2
a very good movie             2
very good movie               2
movie last night              2
                             ..
its amazing what should       1
watched last night i have     1
to se a                       1
very good movie last night    1
a very nice song was          1
Length: 171, dtype: int64

Counter 的另一个想法：

from nltk import ngrams
from collections import Counter

vals = [y for x in df['text'] for y in x.split()]
c = Counter([' '.join(y) for x in [3,4,5] for y in ngrams(vals, x)])

df1 = pd.DataFrame({'ngrams': list(c.keys()),
                   'count': list(c.values())})
print (df1)
                   ngrams  count
0               this is a      1
1               is a very      1
2             a very good      2
3         very good movie      2
4         good movie that      1
..                    ...    ...
166  show a good show was      1
167    a good show was on      1
168   good show was on tv      1
169   show was on tv last      1
170  was on tv last night      1

[171 rows x 2 columns]

【讨论】：

你能不能用文字解释一下：a = pd.Series([y for x in n for y in ngrams(vals, x)]).value_counts()
@taga - 当然，它是flatten lists，这意味着我创建了 3 次 ngram，输出是元组列表。
@taga - 如果使用嵌套列表推导而不像c = [[y for y in ngrams(vals, x)] for x in n] 那样展平，则获取元组列表和解决方案失败，因为需要元组列表