如何组合两个包含列表的熊猫系列？答案

【问题标题】：How to combine two pandas series that contain lists?如何组合两个包含列表的熊猫系列？
【发布时间】：2021-09-10 17:10:56
【问题描述】：

我有两个 pandas 数据框，每行代表一个不同的作者。还有一个名为“publications”的列，表示该作者的 publication_ids 列表，其中 min_len = 1。

df_1 = pd.DataFrame({'publications':[[34499803], [34499125], [34445802, 7092834]]}, index=['0', '4', '2423'])
df_2 = pd.DataFrame({'publications':[[65499803, 56899232], [78999821], [87499234]]}, index=['2234', '543', '345'])

如何将它们组合起来，使结果看起来像这样？

df_sum = pd.DataFrame({'publications':[[65499803, 56899232, 34499803], [78999821, 34499125], [87499234, 34445802, 7092834]]}, index=['0', '4', '2423'])

元素的顺序无关紧要。我尝试使用 +，但我得到 np.NaN，也有 add，但它抱怨类型（TypeError: unsupported operand type(s) for +: 'float' and 'list'）

注意：我编辑了这个问题，因为我意识到我提供的最小示例没有捕获来自索引的问题。当我合并这两个表时，我只关心保留 df_1 索引

【问题讨论】：

df_1 + df_2 很有魅力，add 你试过什么？或者您的示例不能代表问题？
你的例子有效...
我认为问题是NANs，需要this 将它们替换为空列表
两个DataFrame的长度一样吗？

标签： python pandas

【解决方案1】：

这里是不同的索引值，所以如果两个DataFrame的长度相同，添加reset_index(drop=True)：

df = df_1.reset_index(drop=True).add(df_2.reset_index(drop=True))

print (df)
                     publications
0  [34499803, 65499803, 56899232]
1            [34499125, 78999821]
2   [34445802, 7092834, 87499234]

如果需要像 df_1 这样的相同索引，请使用：

df = df_1.add(df_2.set_index(df_1.index))

print (df)
                        publications
0     [34499803, 65499803, 56899232]
4               [34499125, 78999821]
2423   [34445802, 7092834, 87499234]

【讨论】：

由于我的数据集有多个列，我不得不稍微修改一下 df_1.loc[idx_1, 'publications'] = df_1.loc[idx_1, 'column'].add(df_2.loc [idx_2, 'column'].values)

【解决方案2】：

我已经设法通过添加一个单值作为浮点数来重现您的问题：

>>> df_1 = pd.DataFrame({'publications':[[34499803], float(34499125), [34445802, 7092834]]})
>>> df_2 = pd.DataFrame({'publications':[[65499803, 56899232], [78999821], [87499234]]})
>>> df_1+df_2
Traceback (most recent call last):
...
TypeError: unsupported operand type(s) for +: 'float' and 'list'

如果是这种情况，可以通过将单个值转换为列表来解决：

>>> df_1["publications"]=df_1["publications"].apply(lambda x: [x] if isinstance(x, float) else x)
>>> df_1+df_2
                     publications
0  [34499803, 65499803, 56899232]
1          [34499125.0, 78999821]
2   [34445802, 7092834, 87499234]

【讨论】：

【解决方案3】：

我猜索引号很重要

df_1 = pd.DataFrame({'publications':[[34499803], [34499125], [34445802, 7092834]]}, index=['0', '4', '2423'])
df_2 = pd.DataFrame({'publications':[[65499803, 56899232], [78999821], [87499234]]}, index=['2234', '543', '345'])
df_1 = df_1.reset_index(drop=False)
df_2 = df_2.reset_index(drop=True)
df_sum = df_1
df_sum.publications = df_1.publications + df_2.publications
df_sum = df_sum.set_index('index')

                         publications
index                                
0      [34499803, 65499803, 56899232]
4                [34499125, 78999821]
2423    [34445802, 7092834, 87499234]

这样你可以保留索引，但这也假设两个 df 具有相同的长度

【讨论】：