【问题标题】:Filter list of list column then split (explode) row-wisely in Python过滤列表列的列表,然后在 Python 中逐行拆分(分解)
【发布时间】:2021-09-14 12:27:29
【问题描述】:

假设我在数据框中有一列包含列表:

   id                                                pos
0   1  [[['Malaysia','NR'], [':','PU'], ['Natural','JJ'], ['selling price','NN']]]
1   2  [[['Spot Price','NN'], [':','PU'], ['cotton','NN'], ['India', ' NR']]]

或字典格式:

[{'id': 1,
  'pos': "[[['Malaysia','NR'], [':','PU'], ['Natural','JJ'], ['selling price','NN']]]"},
 {'id': 2,
  'pos': "[[['Spot Price','NN'], [':','PU'], ['cotton','NN'], ['India', ' NR']]]"}]

如果列表的第二个元素是 NRNN 然后按行拆分(分解)pos 列,我该如何过滤:

   id          words part_of_speech
0   1        Malasia             NR
1   1  selling price             NN
2   2     Spot price             NN
3   2         cotton             NN
4   2          India             NR

我如何在 Python 中实现这一点?谢谢。

试用码:

l = [[['Malaysia','NR'], [':','PU'], ['Natural','JJ'], ['selling price','NN']]]
for elem in l[0]:
    print(elem[1])

输出:

NR
PU
JJ
NN

【问题讨论】:

  • 你能把它作为字典分享吗:df.to_dict('records')
  • 当然,抱歉,我更新了,请检查。

标签: python python-3.x pandas dataframe


【解决方案1】:

你可以用explode试试这个:

x = df.explode('pos').explode('pos')
x = x[['id']].reset_index(drop=True).join(pd.DataFrame(x['pos'].tolist()).set_axis(['words', 'part_of_speech'], axis=1))
x.loc[x['part_of_speech'].isin(['NN', 'NR'])]

   id          words part_of_speech
0   1       Malaysia             NR
3   1  selling price             NN
4   2     Spot Price             NN
6   2         cotton             NN
7   2          India             NR
>>> 

对于任意长度的数据帧,该解决方案可以轻松扩展,它不会一一分配列,而是一次分配列。所以它适用于任意长度的子列表。

【讨论】:

  • 这应该会在第一行引发错误,str 仅适用于 Series,并且通过 DataFrame 构造函数应该更慢(如果没有升级,则无需计划升级) ;)
  • @mozway 啊,是的!编辑我的答案很抱歉!
【解决方案2】:

这是一个可行的解决方案,它先爆炸,然后过滤,我认为这应该更有效,因为它不需要循环:

# get rid of unnecessary level of nesting
df['pos'] = df['pos'].str[0]
# explode the list
df = df.explode('pos')
# split the two items to separate columns
df['words'] = df['pos'].str[0]
df['part_of_speech'] = df['pos'].str[1]
# filter output
df.drop('pos', axis=1)[df['part_of_speech'].isin(['NR', 'NN'])]

输出:

   id          words part_of_speech
0   1       Malaysia             NR
0   1  selling price             NN
1   2     Spot Price             NN
1   2         cotton             NN

【讨论】:

    最近更新 更多