【发布时间】:2020-06-09 03:58:54
【问题描述】:
假设我们有第一个数据帧 (df1)
t_ent_id,calendar_date,instrument_id,sector,flag
423342,2020-03-11 00:00:00,apple,healthcare,y
544442,2020-03-11 00:00:00,Microsoft,software,y
772222,2020-03-11 00:00:00,amazon,IT,y
986554,2020-03-11 00:00:00,yahoo,IT,n
423342,2020-03-12 00:00:00,apple,healthcare,y
544442,2020-03-12 00:00:00,Microsoft,software,y
772222,2020-03-12 00:00:00,amazon,IT,y
986554,2020-03-12 00:00:00,yahoo,IT,n
第二个数据框为
t_ent_id,calendar_date,instrument_id,sector,flag
423342,2020-03-11 00:00:00,apple,health,y
544442,2020-03-11 00:00:00,Microsoft,software,y
772222,2020-03-11 00:00:00,amazon,IT,n
986554,2020-03-11 00:00:00,yahoo,mail,n
423342,2020-03-12 00:00:00,apple,healthcare,y
544442,2020-03-12 00:00:00,Microsoft,software,y
772222,2020-03-12 00:00:00,amazon,IT,y
986554,2020-03-12 00:00:00,yahoo,IT,n
我想根据主要标识符(即t_ent_id,calendar_date)逐行比较两个数据框(标签)
喜欢
423342,2020-03-11 00:00:00,apple,healthcare,y #df1
423342,2020-03-11 00:00:00,apple,health,y #df2
如果remining标签不匹配,那么它会抛出两行 不匹配的行
423342,2020-03-11 00:00:00,apple,healthcare,y
423342,2020-03-11 00:00:00,apple,health,y
我尝试了以下方法,请建议一些更好的选择
df_f =pd.merge(df_1,df_2,how='outer',left_on=['t_ent_id','calendar_date','instrument_id','sector','flag'],right_on=['t_ent_id','calendar_date','instrument_id','sector','flag'],indicator=True)
print(df_f)
diff = df_f[df_f['_merge']!='both']
diff.sort_values(by=['calendar_date','t_ent_id'])
预期输出
【问题讨论】:
-
不清楚需要什么。预期的输出是什么?
标签: python pandas dataframe compare