创建单词而不是字母的二元组答案

【问题标题】：Creating bigrams of words not letters创建单词而不是字母的二元组
【发布时间】：2020-08-29 22:47:07
【问题描述】：

我正在尝试从我的数据框中给定列中的某些文本创建二元组。问题是我的代码将数据框中的句子拆分为字母而不是二元组（如两个单词）

这是我的数据框示例

    favorites   location    retweets    tweet_text
2832    0   Washington, DC  238517  RT @SpaceX: Liftoff! http
2864    802842  Washington, DC  213853  The United States of America will be designating ANTIFA as a Terrorist Organization.
2851    0   Washington, DC  213853  RT @realDonaldTrump: The United States of America will be designating ANTIFA as a Terrorist Organization.
2914    778873  Washington, DC  146570  CHINA!
288 606090  Washington, DC  138520  IF YOU CAN PROTEST IN PERSON, YOU CAN VOTE IN PERSON!

我的代码：

import nltk
df['bigrams'] = df['tweet_text'].apply(lambda row: list(nltk.ngrams(row, 2)))

和我的输出：

    favorites   location    retweets    tweet_text  bigrams
0   42557   Washington, DC  6500    Landing in New Hampshire!   [(L, a), (a, n), (n, d), (d, i), (i, n), (n, g), (g, ), ( , i), (i, n), (n, ), ( , N), (N, e), (e, w), (w, ), ( , H), (H, a), (a, m), (m, p), (p, s), (s, h), (h, i), (i, r), (r, e), (e, !)]
1   68523   Washington, DC  16901   No, I want Big Ten, and all other football, back - NOW. The Dems don’t want football back, for political reasons, but are trying to blame me and the Republicans. Another LIE, but this is what we are up against! They should also open up all of their Shutdown States.   [(N, o), (o, ,), (,, ), ( , I), (I, ), ( , w), (w, a), (a, n), (n, t), (t, ), ( , B), (B, i), (i, g), (g, ), ( , T), (T, e), (e, n), (n, ,), (,, ), ( , a), (a, n), (n, d), (d, ), ( , a), (a, l), (l, l), (l, ), ( , o), (o, t), (t, h), (h, e), (e, r), (r, ), ( , f), (f, o), (o, o), (o, t), (t, b), (b, a), (a, l), (l, l), (l, ,), (,, ), ( , b), (b, a), (a, c), (c, k), (k, ), ( , -), (-, ), ( , N), (N, O), (O, W), (W, .), (., ), ( , T), (T, h), (h, e), (e, ), ( , D), (D, e), (e, m), (m, s), (s, ), ( , d), (d, o), (o, n), (n, ’), (’, t), (t, ), ( , w), (w, a), (a, n), (n, t), (t, ), ( , f), (f, o), (o, o), (o, t), (t, b), (b, a), (a, l), (l, l), (l, ), ( , b), (b, a), (a, c), (c, k), (k, ,), (,, ), ( , f), (f, o), (o, r), (r, ), ( , p), (p, o), (o, l), (l, i), (i, t), (t, i), ...]

【问题讨论】：

你的预期输出是什么？

标签： python pandas nltk

【解决方案1】：

问题是，如果你希望它产生这样的 ngram，你需要传递给nltk.ngrams 一个单词列表。

所以最简单的解决方案是修改您的代码以传递这样的单词列表：

df['bigrams'] = df['tweet_text'].apply(lambda row: list(nltk.ngrams(row.split(), 2)))

这会在空格上分割单词，所以对于

IF YOU CAN PROTEST IN PERSON, YOU CAN VOTE IN PERSON!

你得到的二元组之一是('IN', 'PERSON,')（注意逗号）。

如果您想稍微不同地拆分单词（取决于您的应用程序），您可能想要创建自己的方法将 row 拆分为单词，并在 row 上调用它而不是调用 @987654327 @。

【讨论】：

为什么不喜欢更惯用的nltk.word_tokenize 而不是str.split？
因为我不知道nltk。 word_tokenize(row) 可能是比row.split() 更好的选择