基于主要标识符的熊猫数据框标签比较答案

【问题标题】：pandas dataframe label comparision on the basis of primary identifier基于主要标识符的熊猫数据框标签比较
【发布时间】：2020-06-09 03:58:54
【问题描述】：

假设我们有第一个数据帧 (df1)

t_ent_id,calendar_date,instrument_id,sector,flag
423342,2020-03-11 00:00:00,apple,healthcare,y
544442,2020-03-11 00:00:00,Microsoft,software,y
772222,2020-03-11 00:00:00,amazon,IT,y
986554,2020-03-11 00:00:00,yahoo,IT,n
423342,2020-03-12 00:00:00,apple,healthcare,y
544442,2020-03-12 00:00:00,Microsoft,software,y
772222,2020-03-12 00:00:00,amazon,IT,y
986554,2020-03-12 00:00:00,yahoo,IT,n

第二个数据框为

t_ent_id,calendar_date,instrument_id,sector,flag
423342,2020-03-11 00:00:00,apple,health,y
544442,2020-03-11 00:00:00,Microsoft,software,y
772222,2020-03-11 00:00:00,amazon,IT,n
986554,2020-03-11 00:00:00,yahoo,mail,n
423342,2020-03-12 00:00:00,apple,healthcare,y
544442,2020-03-12 00:00:00,Microsoft,software,y
772222,2020-03-12 00:00:00,amazon,IT,y
986554,2020-03-12 00:00:00,yahoo,IT,n

我想根据主要标识符（即t_ent_id,calendar_date）逐行比较两个数据框（标签）

喜欢

    423342,2020-03-11 00:00:00,apple,healthcare,y  #df1    
    423342,2020-03-11 00:00:00,apple,health,y      #df2

如果remining标签不匹配，那么它会抛出两行不匹配的行

    423342,2020-03-11 00:00:00,apple,healthcare,y      
    423342,2020-03-11 00:00:00,apple,health,y

我尝试了以下方法，请建议一些更好的选择

df_f =pd.merge(df_1,df_2,how='outer',left_on=['t_ent_id','calendar_date','instrument_id','sector','flag'],right_on=['t_ent_id','calendar_date','instrument_id','sector','flag'],indicator=True)
print(df_f)
diff = df_f[df_f['_merge']!='both']
diff.sort_values(by=['calendar_date','t_ent_id'])

预期输出

【问题讨论】：

不清楚需要什么。预期的输出是什么？

标签： python pandas dataframe compare

【解决方案1】：

IIUC，使用duplicated 和drop_duplicates：

df = pd.concat([df1,df2])
res = df[df.duplicated(subset=["t_ent_id", "calendar_date"], keep=False)].drop_duplicates()
res = res.sort_values(["t_ent_id", "calendar_date"])
print(res)

输出：

   t_ent_id        calendar_date instrument_id      sector flag
0    423342  2020-03-11 00:00:00         apple  healthcare    y
0    423342  2020-03-11 00:00:00         apple      health    y
4    423342  2020-03-12 00:00:00         apple  healthcare    y
1    544442  2020-03-11 00:00:00     Microsoft    software    y
5    544442  2020-03-12 00:00:00     Microsoft    software    y
2    772222  2020-03-11 00:00:00        amazon          IT    y
2    772222  2020-03-11 00:00:00        amazon          IT    n
6    772222  2020-03-12 00:00:00        amazon          IT    y
3    986554  2020-03-11 00:00:00         yahoo          IT    n
3    986554  2020-03-11 00:00:00         yahoo        mail    n
7    986554  2020-03-12 00:00:00         yahoo          IT    n

【讨论】：

【解决方案2】：

将boolean masking 与pd.concat 和sort_values 一起使用：

d1 = df1[df1.ne(df2).any(axis=1)]
d2 = df2[df2.ne(df1).any(axis=1)]
df = pd.concat([d1, d2]).sort_values(by=['t_ent_id','calendar_date'])

# print(df)

   t_ent_id        calendar_date instrument_id      sector flag
0    423342  2020-03-11 00:00:00         apple  healthcare    y
0    423342  2020-03-11 00:00:00         apple      health    y
2    772222  2020-03-11 00:00:00        amazon          IT    y
2    772222  2020-03-11 00:00:00        amazon          IT    n
3    986554  2020-03-11 00:00:00         yahoo          IT    n
3    986554  2020-03-11 00:00:00         yahoo        mail    n

【讨论】：