Python Pandas 使用现有 DataFrame 创建新 DataFrame 以查询另一个 DataFrame答案

【问题标题】：Python Pandas Create New DataFrame Using Existing DataFrame to Query Another DataFramePython Pandas 使用现有 DataFrame 创建新 DataFrame 以查询另一个 DataFrame
【发布时间】：2015-05-22 04:55:07
【问题描述】：

我的目标是构建一个名为 df3 的新 DataFrame（数据框 3）。

使用df1 的['Header 1', 'Header 2', 'Normalized'] 值，我如何在df2 行中查找df1 ['Header 1', 'Header 2', 'Normalized'] 等于df2 ['Header 1', 'Header 2', 'Normalized'] 并从结果？

例如，在df1 的行0 中，Header 1、Header 2 和Normalized 等于df2 行0、1。

df1

            Header 1 Header 2    Header 3 Normalized    Status Match type
0             Boston  Label 1  "phrase 1"   phrase 1  eligible     Phrase
1       DC/Baltimore  Label 2  [phrase 2]   phrase 2  eligible      Exact
2          Philly/NJ  Label 3  "phrase 3"   phrase 3  eligible     Phrase
3          Philly/NJ  Label 4  "phrase 4"   phrase 4  eligible     Phrase
4          Philly/NJ  Label 5  "phrase 5"   phrase 5  eligible     Phrase
5           Portland  Label 6  "phrase 6"   phrase 6  eligible     Phrase
6  Raleigh/Charlotte  Label 7  [phrase 7]   phrase 7  eligible      Exact
7  Raleigh/Charlotte  Label 8  "phrase 8"   phrase 8  eligible     Phrase

df2

             Header 1  Header 2     Header 3 Normalized    Status Match type
0              Boston   Label 1   +phrase +1   phrase 1  eligible      Broad
1              Boston   Label 1   [phrase 1]   phrase 1  eligible      Exact
2        DC/Baltimore   Label 2   +phrase +2   phrase 2  eligible      Broad
3        DC/Baltimore   Label 2   "phrase 2"   phrase 2  eligible     Phrase
4                Frag  Label 22       [what]       what  eligible      Exact
5           Philly/NJ   Label 3   +phrase +3   phrase 3  eligible      Broad
6           Philly/NJ   Label 4   +phrase +4   phrase 4  eligible      Broad
7           Philly/NJ   Label 5   +phrase +5   phrase 5  eligible      Broad
8           Philly/NJ   Label 3   [phrase 3]   phrase 3  eligible      Exact
9           Philly/NJ   Label 4   [phrase 4]   phrase 4  eligible      Exact
10          Philly/NJ   Label 5   [phrase 5]   phrase 5  eligible      Exact
11           Portland   Label 6   +phrase +6   phrase 6  eligible      Broad
12           Portland   Label 6   [phrase 6]   phrase 6  eligible      Exact
13  Raleigh/Charlotte   Label 7   +phrase +7   phrase 7  eligible      Broad
14  Raleigh/Charlotte   Label 8   +phrase +8   phrase 8  eligible      Broad
15  Raleigh/Charlotte   Label 7   "phrase 7"   phrase 7  eligible     Phrase
16  Raleigh/Charlotte   Label 8   [phrase 8]   phrase 8  eligible      Exact

df3 此示例的最终结果将包括来自df1 的所有行和来自df2 的每一行，除了行 (index) 4，因为它的['Header 1', 'Header 2', 'Normalized'] 与df1 中的任何行都不匹配.

我不明白的关键是如何使用一个 DataFrame 中的多个条件来过滤另一个 DataFrame 中的数据？

编辑 1： 我的最终目标是让df3 如下表所示。需要注意的关键是 merges df1 和 df2 整行其中['Header 1', 'Header 2', 'Normalized'] 是相等的。我已经尝试过merge 的建议。它看起来与我需要的完全一样，但我看到附加了后缀 _x、_y 的列标题。如何一口气输出以下内容？我是否必须更改标题标签以匹配原始表的标签并删除几列？还是有更好的方法？

             Header 1   Header 2  Header 3   Normalized  Status   Match type
0              Boston   Label 1   "phrase 1"   phrase 1  eligible     Phrase
1        DC/Baltimore   Label 2   [phrase 2]   phrase 2  eligible     Exact
2           Philly/NJ   Label 3   "phrase 3"   phrase 3  eligible     Phrase
3           Philly/NJ   Label 4   "phrase 4"   phrase 4  eligible     Phrase
4           Philly/NJ   Label 5   "phrase 5"   phrase 5  eligible     Phrase
5            Portland   Label 6   "phrase 6"   phrase 6  eligible     Phrase
6   Raleigh/Charlotte   Label 7   [phrase 7]   phrase 7  eligible     Exact
7   Raleigh/Charlotte   Label 8   "phrase 8"   phrase 8  eligible     Phrase
0              Boston   Label 1   +phrase +1   phrase 1  eligible     Broad
1              Boston   Label 1   [phrase 1]   phrase 1  eligible     Exact
2        DC/Baltimore   Label 2   +phrase +2   phrase 2  eligible     Broad
3        DC/Baltimore   Label 2   "phrase 2"   phrase 2  eligible     Phrase
5           Philly/NJ   Label 3   +phrase +3   phrase 3  eligible     Broad
6           Philly/NJ   Label 4   +phrase +4   phrase 4  eligible     Broad
7           Philly/NJ   Label 5   +phrase +5   phrase 5  eligible     Broad
8           Philly/NJ   Label 3   [phrase 3]   phrase 3  eligible     Exact
9           Philly/NJ   Label 4   [phrase 4]   phrase 4  eligible     Exact
10          Philly/NJ   Label 5   [phrase 5]   phrase 5  eligible     Exact
11           Portland   Label 6   +phrase +6   phrase 6  eligible     Broad
12           Portland   Label 6   [phrase 6]   phrase 6  eligible     Exact
13  Raleigh/Charlotte   Label 7   +phrase +7   phrase 7  eligible     Broad
14  Raleigh/Charlotte   Label 8   +phrase +8   phrase 8  eligible     Broad
15  Raleigh/Charlotte   Label 7   "phrase 7"   phrase 7  eligible     Phrase
16  Raleigh/Charlotte   Label 8   [phrase 8]   phrase 8  eligible     Exact

【问题讨论】：

你的问题是定义最终值应该是左轴还是右轴的标准应该是什么

标签： python-3.x pandas merge

【解决方案1】：

这是您希望使用pandas.merge 的完美示例（基本上是 SQL JOIN 上的 Pandas 等效项，但列相等是唯一允许的连接条件）：

df3 = pandas.merge(df1, df2, on=['Header 1','Header 2', 'Normalized'])

【讨论】：

【解决方案2】：

我想你想要left 风格merge：

In [99]:

df.merge(df1, on=['Header 1','Header 2', 'Normalized'], how='left')
Out[99]:
             Header 1 Header 2  Header 3_x Normalized  Status_x Match type_x  \
0              Boston  Label 1    phrase 1   phrase 1  eligible       Phrase   
1              Boston  Label 1    phrase 1   phrase 1  eligible       Phrase   
2        DC/Baltimore  Label 2  [phrase 2]   phrase 2  eligible        Exact   
3        DC/Baltimore  Label 2  [phrase 2]   phrase 2  eligible        Exact   
4           Philly/NJ  Label 3    phrase 3   phrase 3  eligible       Phrase   
5           Philly/NJ  Label 3    phrase 3   phrase 3  eligible       Phrase   
6           Philly/NJ  Label 4    phrase 4   phrase 4  eligible       Phrase   
7           Philly/NJ  Label 4    phrase 4   phrase 4  eligible       Phrase   
8           Philly/NJ  Label 5    phrase 5   phrase 5  eligible       Phrase   
9           Philly/NJ  Label 5    phrase 5   phrase 5  eligible       Phrase   
10           Portland  Label 6    phrase 6   phrase 6  eligible       Phrase   
11           Portland  Label 6    phrase 6   phrase 6  eligible       Phrase   
12  Raleigh/Charlotte  Label 7  [phrase 7]   phrase 7  eligible        Exact   
13  Raleigh/Charlotte  Label 7  [phrase 7]   phrase 7  eligible        Exact   
14  Raleigh/Charlotte  Label 8    phrase 8   phrase 8  eligible       Phrase   
15  Raleigh/Charlotte  Label 8    phrase 8   phrase 8  eligible       Phrase   

    Header 3_y  Status_y Match type_y  
0   +phrase +1  eligible        Broad  
1   [phrase 1]  eligible        Exact  
2   +phrase +2  eligible        Broad  
3     phrase 2  eligible       Phrase  
4   +phrase +3  eligible        Broad  
5   [phrase 3]  eligible        Exact  
6   +phrase +4  eligible        Broad  
7   [phrase 4]  eligible        Exact  
8   +phrase +5  eligible        Broad  
9   [phrase 5]  eligible        Exact  
10  +phrase +6  eligible        Broad  
11  [phrase 6]  eligible        Exact  
12  +phrase +7  eligible        Broad  
13    phrase 7  eligible       Phrase  
14  +phrase +8  eligible        Broad  
15  [phrase 8]  eligible        Exact

【讨论】：

我在上面的问题中添加了 Edit 1。我看到 merge 有一个 suffixes=('_x', '_y') 应用于重叠列。格式化这个DataFrame 的最佳方法是什么，所以这里的所有列都匹配原始DataFrames 中的相同列，这样我就可以将df1 与这个新的df3 连接起来？我是否只需要删除我不想要的列（带有_x 的后缀）并更改我想要的列的标题还是有更好的方法？
您可以drop、rename 或者在合并中不选择它们。它们最终在合并中的原因是因为值在 lhs 和 rhs 上发生冲突，因此它会为双方生成一个列来保留它们

【解决方案3】：

我不能把回答这个问题归功于我。所有功劳都归功于@EdChum 和@maxymoo，他们用merge 引导我朝着正确的方向前进。我不知道这有多有效，但发布它以防万一有人偶然发现这个类似的问题。

import pandas as pd
import numpy as np

df1 = pd.read_csv('df1.csv')
df2 = pd.read_csv('df2.csv')

df3 = pd.merge(df1, df2, on=['Header 1', 'Header 2', 'Normalized'], how='left')
df3 = df3.drop(['Header 3_x', 'Status_x', 'Match type_x'], axis=1)
df3.columns = ['Header 1', 'Header 2', 'Normalized', 'Header 3', 'Status', 'Match type']
df3 = df3[['Header 1', 'Header 2', 'Header 3', 'Normalized', 'Status', 'Match type']]

print(pd.concat([df1, df3]))

输出：

             Header 1 Header 2     Header 3 Normalized    Status Match type
0              Boston  Label 1   "phrase 1"   phrase 1  eligible     Phrase
1        DC/Baltimore  Label 2   [phrase 2]   phrase 2  eligible      Exact
2           Philly/NJ  Label 3   "phrase 3"   phrase 3  eligible     Phrase
3           Philly/NJ  Label 4   "phrase 4"   phrase 4  eligible     Phrase
4           Philly/NJ  Label 5   "phrase 5"   phrase 5  eligible     Phrase
5            Portland  Label 6   "phrase 6"   phrase 6  eligible     Phrase
6   Raleigh/Charlotte  Label 7   [phrase 7]   phrase 7  eligible      Exact
7   Raleigh/Charlotte  Label 8   "phrase 8"   phrase 8  eligible     Phrase
0              Boston  Label 1   +phrase +1   phrase 1  eligible      Broad
1              Boston  Label 1   [phrase 1]   phrase 1  eligible      Exact
2        DC/Baltimore  Label 2   +phrase +2   phrase 2  eligible      Broad
3        DC/Baltimore  Label 2   "phrase 2"   phrase 2  eligible     Phrase
4           Philly/NJ  Label 3   +phrase +3   phrase 3  eligible      Broad
5           Philly/NJ  Label 3   [phrase 3]   phrase 3  eligible      Exact
6           Philly/NJ  Label 4   +phrase +4   phrase 4  eligible      Broad
7           Philly/NJ  Label 4   [phrase 4]   phrase 4  eligible      Exact
8           Philly/NJ  Label 5   +phrase +5   phrase 5  eligible      Broad
9           Philly/NJ  Label 5   [phrase 5]   phrase 5  eligible      Exact
10           Portland  Label 6   +phrase +6   phrase 6  eligible      Broad
11           Portland  Label 6   [phrase 6]   phrase 6  eligible      Exact
12  Raleigh/Charlotte  Label 7   +phrase +7   phrase 7  eligible      Broad
13  Raleigh/Charlotte  Label 7   "phrase 7"   phrase 7  eligible     Phrase
14  Raleigh/Charlotte  Label 8   +phrase +8   phrase 8  eligible      Broad
15  Raleigh/Charlotte  Label 8   [phrase 8]   phrase 8  eligible      Exact

【讨论】：