比较两个熊猫数据帧上的所有列以获得差异答案

【问题标题】：Comparing all columns on two pandas dataframes to get the difference比较两个熊猫数据帧上的所有列以获得差异
【发布时间】：2019-06-29 07:43:44
【问题描述】：

我有两个熊猫数据框。假设第一个是master

ID  COL1    COL2
1   A       AA
2   B       BB
3   C       CC
4   D       DD

还有一个source

ID  COL1    COL2
1   A       ZZ
2   B       BB
3   YY      CC
5   G       GG
6   H       HH

显然长度可能不同，并且差异可能不止一列。但是，结构将是相同的。我想在source 中找到新的或不同于master 中可用的记录。也就是说，我要找的输出是一个数据框：

ID  COL1    COL2
1   A       ZZ
3   YY      CC
5   G       GG
6   H       HH

我尝试了以下解决方案：

但这些似乎都不适合我。这基本上是试图找出新的东西。

【问题讨论】：

ID一定要一样，还是没关系？
类似于this 的反面 - 这些答案可能会对您有所帮助

标签： python pandas dataframe

【解决方案1】：

您可以创建一个掩码并使用boolean indexing：

# set index
source = source.set_index('ID')
master = master.set_index('ID')

# find any record across rows where source is not in master
mask = (~source.isin(master)).any(1)
# boolean indexing
source[mask]

   COL1 COL2
ID          
1     A   ZZ
3    YY   CC
5     G   GG
6     H   HH

【讨论】：

解决方案通常是错误的，只处理样本数据。尝试更改某些行的顺序以进行检查。

【解决方案2】：

有几种方法可以解决此问题，具体取决于您处理内存分配的方式以及您是否打算使用大型数据集或者是否仅用于学术/培训目的。

遍历比较，并将它们附加到新的数据帧。（更多代码，更高效的内存）
创建一个新的合并（外部）数据框并应用一个函数来删除重复项。（代码更少，但内存效率更低）

这只是两个想法，但可能还有更多，只是为了提供见解。

解决方案 1：（考虑到 ID 是唯一的，而不是索引）

list = source['ID'].tolist() #get a list of all the ids in source
results = pd.DataFrame(columns = source.columns.tolist()) #Creates an empty df with same columns
for id in list:
    if(~((source[id]['COL1'] == master[id]['COL1']) & (source[id]['COL2'] == master[id]['COL2']))):
    #Here we evaluate the cases where everything is equal and execute on negation of said statement (by using ~, which equates to NOT)
        results.append(source[id])

解决方案2：

results = source.merge(master, how = 'outer', on= source.columns.tolist()) #assuming both dfs have same columns
final_results = results.drop_duplicates(Keep = False) #this will drop all rows that are duplicated.

【讨论】：

【解决方案3】：

将merge 与indicator=True 和outer join 一起使用，然后过滤并仅获取df2.columns 的列：

#specified columns in list
cols = ['COL1','COL2']
#all columns without ID
#cols = df.columns.difference(['ID'])
df = (df1.merge(df2, on=cols, how='outer', indicator=True, suffixes=('_',''))
         .query("_merge == 'right_only'")[df2.columns])
print (df)
    ID COL1 COL2
4  1.0    A   ZZ
5  3.0   YY   CC
6  5.0    G   GG
7  6.0    H   HH

【讨论】：