【问题标题】:Python Pandas Create New DataFrame Using Existing DataFrame to Query Another DataFramePython Pandas 使用现有 DataFrame 创建新 DataFrame 以查询另一个 DataFrame
【发布时间】:2015-05-22 04:55:07
【问题描述】:

我的目标是构建一个名为 df3 的新 DataFrame(数据框 3)。

使用df1['Header 1', 'Header 2', 'Normalized'] 值,我如何在df2 行中查找df1 ['Header 1', 'Header 2', 'Normalized'] 等于df2 ['Header 1', 'Header 2', 'Normalized'] 并从结果?

例如,在df1 的行0 中,Header 1Header 2Normalized 等于df201

df1

            Header 1 Header 2    Header 3 Normalized    Status Match type
0             Boston  Label 1  "phrase 1"   phrase 1  eligible     Phrase
1       DC/Baltimore  Label 2  [phrase 2]   phrase 2  eligible      Exact
2          Philly/NJ  Label 3  "phrase 3"   phrase 3  eligible     Phrase
3          Philly/NJ  Label 4  "phrase 4"   phrase 4  eligible     Phrase
4          Philly/NJ  Label 5  "phrase 5"   phrase 5  eligible     Phrase
5           Portland  Label 6  "phrase 6"   phrase 6  eligible     Phrase
6  Raleigh/Charlotte  Label 7  [phrase 7]   phrase 7  eligible      Exact
7  Raleigh/Charlotte  Label 8  "phrase 8"   phrase 8  eligible     Phrase

df2

             Header 1  Header 2     Header 3 Normalized    Status Match type
0              Boston   Label 1   +phrase +1   phrase 1  eligible      Broad
1              Boston   Label 1   [phrase 1]   phrase 1  eligible      Exact
2        DC/Baltimore   Label 2   +phrase +2   phrase 2  eligible      Broad
3        DC/Baltimore   Label 2   "phrase 2"   phrase 2  eligible     Phrase
4                Frag  Label 22       [what]       what  eligible      Exact
5           Philly/NJ   Label 3   +phrase +3   phrase 3  eligible      Broad
6           Philly/NJ   Label 4   +phrase +4   phrase 4  eligible      Broad
7           Philly/NJ   Label 5   +phrase +5   phrase 5  eligible      Broad
8           Philly/NJ   Label 3   [phrase 3]   phrase 3  eligible      Exact
9           Philly/NJ   Label 4   [phrase 4]   phrase 4  eligible      Exact
10          Philly/NJ   Label 5   [phrase 5]   phrase 5  eligible      Exact
11           Portland   Label 6   +phrase +6   phrase 6  eligible      Broad
12           Portland   Label 6   [phrase 6]   phrase 6  eligible      Exact
13  Raleigh/Charlotte   Label 7   +phrase +7   phrase 7  eligible      Broad
14  Raleigh/Charlotte   Label 8   +phrase +8   phrase 8  eligible      Broad
15  Raleigh/Charlotte   Label 7   "phrase 7"   phrase 7  eligible     Phrase
16  Raleigh/Charlotte   Label 8   [phrase 8]   phrase 8  eligible      Exact

df3 此示例的最终结果将包括来自df1 的所有行和来自df2 的每一行,除了行 (index) 4,因为它的['Header 1', 'Header 2', 'Normalized']df1 中的任何行都不匹配.

我不明白的关键是如何使用一个 DataFrame 中的多个条件来过滤另一个 DataFrame 中的数据?

编辑 1: 我的最终目标是让df3 如下表所示。需要注意的关键是 merges df1df2 整行 其中['Header 1', 'Header 2', 'Normalized'] 是相等的。我已经尝试过merge 的建议。它看起来与我需要的完全一样,但我看到附加了后缀 _x_y 的列标题。如何一口气输出以下内容?我是否必须更改标题标签以匹配原始表的标签并删除几列?还是有更好的方法?

             Header 1   Header 2  Header 3   Normalized  Status   Match type
0              Boston   Label 1   "phrase 1"   phrase 1  eligible     Phrase
1        DC/Baltimore   Label 2   [phrase 2]   phrase 2  eligible     Exact
2           Philly/NJ   Label 3   "phrase 3"   phrase 3  eligible     Phrase
3           Philly/NJ   Label 4   "phrase 4"   phrase 4  eligible     Phrase
4           Philly/NJ   Label 5   "phrase 5"   phrase 5  eligible     Phrase
5            Portland   Label 6   "phrase 6"   phrase 6  eligible     Phrase
6   Raleigh/Charlotte   Label 7   [phrase 7]   phrase 7  eligible     Exact
7   Raleigh/Charlotte   Label 8   "phrase 8"   phrase 8  eligible     Phrase
0              Boston   Label 1   +phrase +1   phrase 1  eligible     Broad
1              Boston   Label 1   [phrase 1]   phrase 1  eligible     Exact
2        DC/Baltimore   Label 2   +phrase +2   phrase 2  eligible     Broad
3        DC/Baltimore   Label 2   "phrase 2"   phrase 2  eligible     Phrase
5           Philly/NJ   Label 3   +phrase +3   phrase 3  eligible     Broad
6           Philly/NJ   Label 4   +phrase +4   phrase 4  eligible     Broad
7           Philly/NJ   Label 5   +phrase +5   phrase 5  eligible     Broad
8           Philly/NJ   Label 3   [phrase 3]   phrase 3  eligible     Exact
9           Philly/NJ   Label 4   [phrase 4]   phrase 4  eligible     Exact
10          Philly/NJ   Label 5   [phrase 5]   phrase 5  eligible     Exact
11           Portland   Label 6   +phrase +6   phrase 6  eligible     Broad
12           Portland   Label 6   [phrase 6]   phrase 6  eligible     Exact
13  Raleigh/Charlotte   Label 7   +phrase +7   phrase 7  eligible     Broad
14  Raleigh/Charlotte   Label 8   +phrase +8   phrase 8  eligible     Broad
15  Raleigh/Charlotte   Label 7   "phrase 7"   phrase 7  eligible     Phrase
16  Raleigh/Charlotte   Label 8   [phrase 8]   phrase 8  eligible     Exact

【问题讨论】:

  • 你的问题是定义最终值应该是左轴还是右轴的标准应该是什么

标签: python-3.x pandas merge


【解决方案1】:

这是您希望使用pandas.merge 的完美示例(基本上是 SQL JOIN 上的 Pandas 等效项,但列相等是唯一允许的连接条件):

df3 = pandas.merge(df1, df2, on=['Header 1','Header 2', 'Normalized'])

【讨论】:

    【解决方案2】:

    我想你想要left 风格merge

    In [99]:
    
    df.merge(df1, on=['Header 1','Header 2', 'Normalized'], how='left')
    Out[99]:
                 Header 1 Header 2  Header 3_x Normalized  Status_x Match type_x  \
    0              Boston  Label 1    phrase 1   phrase 1  eligible       Phrase   
    1              Boston  Label 1    phrase 1   phrase 1  eligible       Phrase   
    2        DC/Baltimore  Label 2  [phrase 2]   phrase 2  eligible        Exact   
    3        DC/Baltimore  Label 2  [phrase 2]   phrase 2  eligible        Exact   
    4           Philly/NJ  Label 3    phrase 3   phrase 3  eligible       Phrase   
    5           Philly/NJ  Label 3    phrase 3   phrase 3  eligible       Phrase   
    6           Philly/NJ  Label 4    phrase 4   phrase 4  eligible       Phrase   
    7           Philly/NJ  Label 4    phrase 4   phrase 4  eligible       Phrase   
    8           Philly/NJ  Label 5    phrase 5   phrase 5  eligible       Phrase   
    9           Philly/NJ  Label 5    phrase 5   phrase 5  eligible       Phrase   
    10           Portland  Label 6    phrase 6   phrase 6  eligible       Phrase   
    11           Portland  Label 6    phrase 6   phrase 6  eligible       Phrase   
    12  Raleigh/Charlotte  Label 7  [phrase 7]   phrase 7  eligible        Exact   
    13  Raleigh/Charlotte  Label 7  [phrase 7]   phrase 7  eligible        Exact   
    14  Raleigh/Charlotte  Label 8    phrase 8   phrase 8  eligible       Phrase   
    15  Raleigh/Charlotte  Label 8    phrase 8   phrase 8  eligible       Phrase   
    
        Header 3_y  Status_y Match type_y  
    0   +phrase +1  eligible        Broad  
    1   [phrase 1]  eligible        Exact  
    2   +phrase +2  eligible        Broad  
    3     phrase 2  eligible       Phrase  
    4   +phrase +3  eligible        Broad  
    5   [phrase 3]  eligible        Exact  
    6   +phrase +4  eligible        Broad  
    7   [phrase 4]  eligible        Exact  
    8   +phrase +5  eligible        Broad  
    9   [phrase 5]  eligible        Exact  
    10  +phrase +6  eligible        Broad  
    11  [phrase 6]  eligible        Exact  
    12  +phrase +7  eligible        Broad  
    13    phrase 7  eligible       Phrase  
    14  +phrase +8  eligible        Broad  
    15  [phrase 8]  eligible        Exact  
    

    【讨论】:

    • 我在上面的问题中添加了 Edit 1。我看到 merge 有一个 suffixes=('_x', '_y') 应用于重叠列。格式化这个DataFrame 的最佳方法是什么,所以这里的所有列都匹配原始DataFrames 中的相同列,这样我就可以将df1 与这个新的df3 连接起来?我是否只需要删除我不想要的列(带有_x 的后缀)并更改我想要的列的标题还是有更好的方法?
    • 您可以droprename 或者在合并中不选择它们。它们最终在合并中的原因是因为值在 lhs 和 rhs 上发生冲突,因此它会为双方生成一个列来保留它们
    【解决方案3】:

    我不能把回答这个问题归功于我。所有功劳都归功于@EdChum 和@maxymoo,他们用merge 引导我朝着正确的方向前进。我不知道这有多有效,但发布它以防万一有人偶然发现这个类似的问题。

    import pandas as pd
    import numpy as np
    
    df1 = pd.read_csv('df1.csv')
    df2 = pd.read_csv('df2.csv')
    
    df3 = pd.merge(df1, df2, on=['Header 1', 'Header 2', 'Normalized'], how='left')
    df3 = df3.drop(['Header 3_x', 'Status_x', 'Match type_x'], axis=1)
    df3.columns = ['Header 1', 'Header 2', 'Normalized', 'Header 3', 'Status', 'Match type']
    df3 = df3[['Header 1', 'Header 2', 'Header 3', 'Normalized', 'Status', 'Match type']]
    
    print(pd.concat([df1, df3]))
    

    输出:

                 Header 1 Header 2     Header 3 Normalized    Status Match type
    0              Boston  Label 1   "phrase 1"   phrase 1  eligible     Phrase
    1        DC/Baltimore  Label 2   [phrase 2]   phrase 2  eligible      Exact
    2           Philly/NJ  Label 3   "phrase 3"   phrase 3  eligible     Phrase
    3           Philly/NJ  Label 4   "phrase 4"   phrase 4  eligible     Phrase
    4           Philly/NJ  Label 5   "phrase 5"   phrase 5  eligible     Phrase
    5            Portland  Label 6   "phrase 6"   phrase 6  eligible     Phrase
    6   Raleigh/Charlotte  Label 7   [phrase 7]   phrase 7  eligible      Exact
    7   Raleigh/Charlotte  Label 8   "phrase 8"   phrase 8  eligible     Phrase
    0              Boston  Label 1   +phrase +1   phrase 1  eligible      Broad
    1              Boston  Label 1   [phrase 1]   phrase 1  eligible      Exact
    2        DC/Baltimore  Label 2   +phrase +2   phrase 2  eligible      Broad
    3        DC/Baltimore  Label 2   "phrase 2"   phrase 2  eligible     Phrase
    4           Philly/NJ  Label 3   +phrase +3   phrase 3  eligible      Broad
    5           Philly/NJ  Label 3   [phrase 3]   phrase 3  eligible      Exact
    6           Philly/NJ  Label 4   +phrase +4   phrase 4  eligible      Broad
    7           Philly/NJ  Label 4   [phrase 4]   phrase 4  eligible      Exact
    8           Philly/NJ  Label 5   +phrase +5   phrase 5  eligible      Broad
    9           Philly/NJ  Label 5   [phrase 5]   phrase 5  eligible      Exact
    10           Portland  Label 6   +phrase +6   phrase 6  eligible      Broad
    11           Portland  Label 6   [phrase 6]   phrase 6  eligible      Exact
    12  Raleigh/Charlotte  Label 7   +phrase +7   phrase 7  eligible      Broad
    13  Raleigh/Charlotte  Label 7   "phrase 7"   phrase 7  eligible     Phrase
    14  Raleigh/Charlotte  Label 8   +phrase +8   phrase 8  eligible      Broad
    15  Raleigh/Charlotte  Label 8   [phrase 8]   phrase 8  eligible      Exact
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2014-02-06
      • 2021-07-12
      • 2017-02-08
      • 2016-03-05
      • 2023-01-10
      • 1970-01-01
      • 2013-06-05
      相关资源
      最近更新 更多