将数据框中的一列与另一个数据框中的多列匹配的最佳方法答案

【问题标题】：best way to match one column in dataframe to multiple columns in another dataframe将数据框中的一列与另一个数据框中的多列匹配的最佳方法
【发布时间】：2021-07-23 02:02:54
【问题描述】：

假设我有这个 df1：

df1 = pd.DataFrame({'name':['Sara',  'John', 'Christine'],

                   'email': ['sara@example.com', 'john@example.com', 'Christine@example.com']})

df1:

    name       email
0   Sara       sara@example.com
1   John       john@example.com
2   Christine  Christine@example.com

df2:

df2 = pd.DataFrame({'email_id':['sara@example.com',  np.nan , 'flower@example8.com'],

                   'alternate email': ['sara@example.com', 'john.walker@example.com' , 'Christine33@example.com'],
                   'alternate email2': ['sara13@example.com', 'john@example.com', 'Christine@example.com'],
                   'country': ['US', 'BR', 'CA']})

df2:

        email_id            alternate email          alternate email2          country
0   sara@example.com       sara@example.com          sara13@example.com             US
1   NaN                    john.walker@example.com  john@example.com                BR
2   flower@example8.com    Christine33@example.com   Christine@example.com          CA

现在我想将 df1 中的电子邮件列与 df2 中的 [email_id, alternate email, alternate email2] 列进行匹配，如果找到匹配项，我将得到姓名和国家/地区：

输出：

    name         email                   Match
0   Sara         sara@example.com         US
1   John         john@example.com         BR
2   Christine    Christine@example.com    CA

我使用了以下完美运行的代码：

df1['Match'] = np.where((df1['email'].isin(df2['email_id'])) | (df1['email'].isin(df2['alternate email2'])) | (df1['email'].isin(df2['alternate email'])), df1.country , 0)

但是在不同的数据集上我又遇到了一个错误：

ValueError: operands could not be broadcast together with shapes (16622,) (433541,) ()

那么，将 df1 中的一列匹配到 df2 中的多列并合并每个匹配行的结果的最佳标准方法是什么？

【问题讨论】：

不匹配的行怎么办？你想删除那些行吗？
@PrantaPalit 不匹配的行应保留或填充 NaN / 0

标签： python python-3.x pandas dataframe

【解决方案1】：

尝试：

这个想法是在 cols 的每一列上合并 df1 的“电子邮件”（存在于 df2 中，命名为电子邮件）

cols=['email_id', 'alternate email', 'alternate email2']
out=(pd.concat([df1.merge(df2,left_on='email',right_on=x) for x in cols])
       .drop_duplicates(subset=['name'],ignore_index=True).drop(cols,1))

out的输出：

    name        email                   country
0   Sara        sara@example.com        US
1   John        john@example.com        BR
2   Christine   Christine@example.com   CA

【讨论】：

两列长度不一样怎么办？它将通过ValueError
@PrantaPalit 在这种情况下也可以使用