添加新列，其中包含两个数据框之间的匹配值列表答案

【问题标题】：Add new column with a list of the matching values between two dataframes添加新列，其中包含两个数据框之间的匹配值列表
【发布时间】：2021-09-24 19:00:23
【问题描述】：

我有 2 个具有以下结构的数据框：

DF1

ItemID     Item
Id1        Item1
Id2        Item2
Id3        Item3
...        ...
1000       Item1000

DF2

Index     ListOfItems
0         [Item1]
1         [Item1, Item3, Item5]
2         [Item2, Item3]
...       ...
N         [NItems]

这将是我的预期输出：

Index     ListOfItems               ListOfIds
0         [Item1]                   [Id1]
1         [Item1, Item3, Item5]     [Id1, Id3, Id5]
2         [Item2, Item3]            [Id2, Id3]
...       ...                       ...
N         [NItems]                  [NIds]

将第二个 Dataframe 的 ListOfItmes 与第一个 Dataframe 的 Id 匹配，并在新列中创建 Id 列表

这是在不断变化的大型数据帧上完成的，因此性能很重要。我尝试了一些方法，但性能很差。

【问题讨论】：

您能否更改 DF2 以便您不需要在数据框中包含列表？如果您改为使用每个项目 1 行的 DF1，然后为导致项目出现在某些列表中的任何内容添加列，您的性能可能会好很多。例如，col 'source1' 将为 Item 1 获得 1，为所有其他项获得 0，'source2' 将为 Item1、Item3、Item5 获得 1。然后将 df 过滤为 source1 == 1 以生成 source1 的项目列表。
不幸的是，我不能这样做，因为 dF2 来自具有我在示例中提供的信息的特定数据库。

标签： python pandas dataframe loops list-comprehension

【解决方案1】：

试试：

df2['ListOfIds'] = df2['ListOfItems'].apply(lambda x: df1[df1['Item'].isin(x)].index.to_list())

样本数据：

>>> df1
         Item
ItemID       
Id1     Item1
Id2     Item2
Id3     Item3

>>> df2
                 ListOfItems
Index                       
0                    [Item1]
1      [Item1, Item3, Item5]
2             [Item2, Item3]

输出：

                 ListOfItems   ListOfIds
Index                                   
0                    [Item1]       [Id1]
1      [Item1, Item3, Item5]  [Id1, Id3]
2             [Item2, Item3]  [Id2, Id3]

上述解决方案希望您将ListOfItems 列中的值作为列表而不是字符串，如果不是列表，您可以执行以下操作将其从字符串转换为列表：

df2['ListOfItems'] = df2['ListOfItems'].str[1:-1].str.split(',').apply(lambda x: [i.strip() for  i in x])

【讨论】：

这很好用，感谢您的帮助。这些列是实际列表，所以第一个选项工作得很好，但最好有 string-toList 选项。

【解决方案2】：

df2=df2.replace(regex={'\[':'','\]':''})#去除角括号

#in df2 使 ListOfItems 成为一个列表并爆炸，创建名为的新列

ListOfIds and map over the ids
df2=df2.assign(ListOfItems=df2['ListOfItems'].str.split(',')).explode('ListOfItems').assign(ListOfIds=df2['ListOfItems'].map(dict(zip(df1['Item'], df1['ItemID']))))

#Groupby 索引和年龄列表

df2.groupby('Index').agg(list)

【讨论】：