【发布时间】:2022-06-29 22:32:41
【问题描述】:
我有一个数据框:
import pandas as pd
data = {'token_1': [['cat', 'bag', 'sitting'],
['dog', 'eats', 'bowls'],
['mouse', 'mustache', 'tail'],
['dog', 'eat', 'meat']],
'token_2': [['cat', 'from', 'bag', 'cat', 'in', 'bag', 'sitting', 'whole', 'day'],
['dog', 'eats', 'from', 'bowls', 'dog', 'eats', 'always', 'from', 'bowls', 'eats', 'bowl'],
['mouse', 'with', 'a', 'big', 'tail', 'and,' 'ears', 'a', 'mouse', 'with', 'a', 'mustache', 'and', 'a', 'tail' ,'runs', 'fast'],
['dog', 'eat', 'meat', 'chicken', 'from', 'bowl','dog','see','meat','eat']]}
df = pd.DataFrame(data)
token_1 列中没有连词和介词。我想从token_2 列中找到它们。即找到交叉标记之间的单词。
据我了解有几个步骤:
- 检查token_1的第一个交集
- 查看下一个单词是否小于4的长度,如果是则将其添加到列表中。如果不是,则转到token_1[1]的第一个交集
- 我们再次查看下一个单词是否小于长度 4
- 重复这个过程,直到我们到达最后一个 token_1[2]
- 如果令牌之间没有任何内容,则返回它们
或者有更简单的方法吗?最后,我想得到一个new_token专栏:
+-----------------------+---------------------------------+--------------------------------------------------------------------------------------------+
|token_1 |new_tokens |token_2 |
+-----------------------+---------------------------------+--------------------------------------------------------------------------------------------+
|[cat, bag, sitting] |[cat, in, bag, sitting] |[cat, from, bag, cat, in, bag, sitting, whole, day] |
|[dog, eats, bowls] |[dog, eats, from, bowls] |[dog, eats, from, bowls, dog, eats, always, from, bowls, eats, bowl] |
|[mouse, mustache, tail]|[mouse, with,mustache, and, tail]|[mouse, with, a, big, tail, and,ears, a, mouse, with, a, mustache, and, a, tail, runs, fast]|
|[dog, eat, meat] |[dog, eat, meat] |[dog, eat, meat, chicken, from, bowl, dog, see, meat, eat] |
+-----------------------+---------------------------------+--------------------------------------------------------------------------------------------+
【问题讨论】: