列表理解中的第一个split文本并展平为vals,然后创建ngrams,传递给Series,最后使用Series.value_counts:
from nltk import ngrams
vals = [y for x in df['text'] for y in x.split()]
n = [3,4,5]
a = pd.Series([y for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
(a, good, show) 2
(movie, last, night) 2
(a, very, good) 2
(last, night, i) 2
(a, very, good, movie) 2
..
(should, we, do) 1
(a, very, nice, song, was) 1
(asks, for, it, movie, last) 1
(this, song,, its, amazing, what) 1
(i, have, watched, a) 1
Length: 171, dtype: int64
或者如果元组应该用空格连接:
n = [3,4,5]
a = pd.Series([' '.join(y) for x in n for y in ngrams(vals, x)]).value_counts()
print (a)
last night i 2
a good show 2
a very good movie 2
very good movie 2
movie last night 2
..
its amazing what should 1
watched last night i have 1
to se a 1
very good movie last night 1
a very nice song was 1
Length: 171, dtype: int64
Counter 的另一个想法:
from nltk import ngrams
from collections import Counter
vals = [y for x in df['text'] for y in x.split()]
c = Counter([' '.join(y) for x in [3,4,5] for y in ngrams(vals, x)])
df1 = pd.DataFrame({'ngrams': list(c.keys()),
'count': list(c.values())})
print (df1)
ngrams count
0 this is a 1
1 is a very 1
2 a very good 2
3 very good movie 2
4 good movie that 1
.. ... ...
166 show a good show was 1
167 a good show was on 1
168 good show was on tv 1
169 show was on tv last 1
170 was on tv last night 1
[171 rows x 2 columns]