【发布时间】:2020-08-29 22:47:07
【问题描述】:
我正在尝试从我的数据框中给定列中的某些文本创建二元组。问题是我的代码将数据框中的句子拆分为字母而不是二元组(如两个单词)
这是我的数据框示例
favorites location retweets tweet_text
2832 0 Washington, DC 238517 RT @SpaceX: Liftoff! http
2864 802842 Washington, DC 213853 The United States of America will be designating ANTIFA as a Terrorist Organization.
2851 0 Washington, DC 213853 RT @realDonaldTrump: The United States of America will be designating ANTIFA as a Terrorist Organization.
2914 778873 Washington, DC 146570 CHINA!
288 606090 Washington, DC 138520 IF YOU CAN PROTEST IN PERSON, YOU CAN VOTE IN PERSON!
我的代码:
import nltk
df['bigrams'] = df['tweet_text'].apply(lambda row: list(nltk.ngrams(row, 2)))
和我的输出:
favorites location retweets tweet_text bigrams
0 42557 Washington, DC 6500 Landing in New Hampshire! [(L, a), (a, n), (n, d), (d, i), (i, n), (n, g), (g, ), ( , i), (i, n), (n, ), ( , N), (N, e), (e, w), (w, ), ( , H), (H, a), (a, m), (m, p), (p, s), (s, h), (h, i), (i, r), (r, e), (e, !)]
1 68523 Washington, DC 16901 No, I want Big Ten, and all other football, back - NOW. The Dems don’t want football back, for political reasons, but are trying to blame me and the Republicans. Another LIE, but this is what we are up against! They should also open up all of their Shutdown States. [(N, o), (o, ,), (,, ), ( , I), (I, ), ( , w), (w, a), (a, n), (n, t), (t, ), ( , B), (B, i), (i, g), (g, ), ( , T), (T, e), (e, n), (n, ,), (,, ), ( , a), (a, n), (n, d), (d, ), ( , a), (a, l), (l, l), (l, ), ( , o), (o, t), (t, h), (h, e), (e, r), (r, ), ( , f), (f, o), (o, o), (o, t), (t, b), (b, a), (a, l), (l, l), (l, ,), (,, ), ( , b), (b, a), (a, c), (c, k), (k, ), ( , -), (-, ), ( , N), (N, O), (O, W), (W, .), (., ), ( , T), (T, h), (h, e), (e, ), ( , D), (D, e), (e, m), (m, s), (s, ), ( , d), (d, o), (o, n), (n, ’), (’, t), (t, ), ( , w), (w, a), (a, n), (n, t), (t, ), ( , f), (f, o), (o, o), (o, t), (t, b), (b, a), (a, l), (l, l), (l, ), ( , b), (b, a), (a, c), (c, k), (k, ,), (,, ), ( , f), (f, o), (o, r), (r, ), ( , p), (p, o), (o, l), (l, i), (i, t), (t, i), ...]
【问题讨论】:
-
你的预期输出是什么?