【问题标题】:pandas dataframe label comparision on the basis of primary identifier基于主要标识符的熊猫数据框标签比较
【发布时间】:2020-06-09 03:58:54
【问题描述】:

假设我们有第一个数据帧 (df1)

t_ent_id,calendar_date,instrument_id,sector,flag
423342,2020-03-11 00:00:00,apple,healthcare,y
544442,2020-03-11 00:00:00,Microsoft,software,y
772222,2020-03-11 00:00:00,amazon,IT,y
986554,2020-03-11 00:00:00,yahoo,IT,n
423342,2020-03-12 00:00:00,apple,healthcare,y
544442,2020-03-12 00:00:00,Microsoft,software,y
772222,2020-03-12 00:00:00,amazon,IT,y
986554,2020-03-12 00:00:00,yahoo,IT,n

第二个数据框为

t_ent_id,calendar_date,instrument_id,sector,flag
423342,2020-03-11 00:00:00,apple,health,y
544442,2020-03-11 00:00:00,Microsoft,software,y
772222,2020-03-11 00:00:00,amazon,IT,n
986554,2020-03-11 00:00:00,yahoo,mail,n
423342,2020-03-12 00:00:00,apple,healthcare,y
544442,2020-03-12 00:00:00,Microsoft,software,y
772222,2020-03-12 00:00:00,amazon,IT,y
986554,2020-03-12 00:00:00,yahoo,IT,n

我想根据主要标识符(即t_ent_id,calendar_date)逐行比较两个数据框(标签)

喜欢

    423342,2020-03-11 00:00:00,apple,healthcare,y  #df1    
    423342,2020-03-11 00:00:00,apple,health,y      #df2

如果remining标签不匹配,那么它会抛出两行 不匹配的行

    423342,2020-03-11 00:00:00,apple,healthcare,y      
    423342,2020-03-11 00:00:00,apple,health,y

我尝试了以下方法,请建议一些更好的选择

df_f =pd.merge(df_1,df_2,how='outer',left_on=['t_ent_id','calendar_date','instrument_id','sector','flag'],right_on=['t_ent_id','calendar_date','instrument_id','sector','flag'],indicator=True)
print(df_f)
diff = df_f[df_f['_merge']!='both']
diff.sort_values(by=['calendar_date','t_ent_id'])

预期输出

【问题讨论】:

  • 不清楚需要什么。预期的输出是什么?

标签: python pandas dataframe compare


【解决方案1】:

IIUC,使用duplicateddrop_duplicates

df = pd.concat([df1,df2])
res = df[df.duplicated(subset=["t_ent_id", "calendar_date"], keep=False)].drop_duplicates()
res = res.sort_values(["t_ent_id", "calendar_date"])
print(res)

输出:

   t_ent_id        calendar_date instrument_id      sector flag
0    423342  2020-03-11 00:00:00         apple  healthcare    y
0    423342  2020-03-11 00:00:00         apple      health    y
4    423342  2020-03-12 00:00:00         apple  healthcare    y
1    544442  2020-03-11 00:00:00     Microsoft    software    y
5    544442  2020-03-12 00:00:00     Microsoft    software    y
2    772222  2020-03-11 00:00:00        amazon          IT    y
2    772222  2020-03-11 00:00:00        amazon          IT    n
6    772222  2020-03-12 00:00:00        amazon          IT    y
3    986554  2020-03-11 00:00:00         yahoo          IT    n
3    986554  2020-03-11 00:00:00         yahoo        mail    n
7    986554  2020-03-12 00:00:00         yahoo          IT    n

【讨论】:

    【解决方案2】:

    boolean maskingpd.concatsort_values 一起使用:

    d1 = df1[df1.ne(df2).any(axis=1)]
    d2 = df2[df2.ne(df1).any(axis=1)]
    df = pd.concat([d1, d2]).sort_values(by=['t_ent_id','calendar_date'])
    

    # print(df)
    
       t_ent_id        calendar_date instrument_id      sector flag
    0    423342  2020-03-11 00:00:00         apple  healthcare    y
    0    423342  2020-03-11 00:00:00         apple      health    y
    2    772222  2020-03-11 00:00:00        amazon          IT    y
    2    772222  2020-03-11 00:00:00        amazon          IT    n
    3    986554  2020-03-11 00:00:00         yahoo          IT    n
    3    986554  2020-03-11 00:00:00         yahoo        mail    n
    

    【讨论】:

      猜你喜欢
      • 2017-01-28
      • 2020-01-11
      • 1970-01-01
      • 1970-01-01
      • 2017-05-28
      • 1970-01-01
      • 2015-08-11
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多