将数据框与包含它的较大数据框相交并删除公共行答案

【问题标题】：Intersect a dataframe with a larger one that includes it and remove common rows将数据框与包含它的较大数据框相交并删除公共行
【发布时间】：2019-10-09 10:39:10
【问题描述】：

我有两个数据框：

df_small = pd.DataFrame(np.array([[1, 2, 3], 
                                  [4, 5, 6], 
                                  [7, 8, 9]]),
                     columns=['a', 'b', 'c'])

和

df_large = pd.DataFrame(np.array([[22, 1, 2, 3, 99], 
                                  [31, 4, 5, 6, 75], 
                                  [73, 7, 8, 9, 23],
                                  [16, 2, 1, 2, 13],
                                  [17, 1, 4, 3, 25],
                                  [93, 3, 2, 8, 18]]),
                     columns=['k', 'a', 'b', 'c', 'd'])

现在我想要将两者相交，只取df_large 中不包含df_small 行的行，因此结果应该是：

df_result = pd.DataFrame(np.array([[16, 2, 1, 2, 13],
                                   [17, 1, 4, 3, 25],
                                   [93, 3, 2, 8, 18]]),
                     columns=['k', 'a', 'b', 'c', 'd'])

【问题讨论】：

标签： pandas dataframe intersection

【解决方案1】：

将DataFrame.merge 与indicator=True 和left join 一起使用，并且因为错误是必要的，所以从df_small 中删除DataFrame.drop_duplicates 的重复项：

m = df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)['_merge'].ne('both')
df = df_large[m]
print (df)
    k  a  b  c   d
3  16  2  1  2  13
4  17  1  4  3  25
5  93  3  2  8  18

另一种解决方案非常相似，仅通过query 和最后删除的列_merge 过滤：

df = (df_large.merge(df_small.drop_duplicates(), how='left', indicator=True)
              .query('_merge != "both"')
              .drop('_merge', axis=1))

【讨论】：

IndexingError：作为索引器提供的不可对齐的布尔系列（布尔系列的索引和索引对象的索引不匹配）。
@Quubix - 我认为df_small 应该是骗子，所以有必要在merge 之前删除它们
感谢您解释“开”不是必需的，我已将其从答案中删除

【解决方案2】：

使用DataFrame.merge：

df_large.merge(df_small,how='outer',indicator=True).query('_merge == "left_only"').drop('_merge', axis=1)

输出：

    k  a  b  c   d
3  16  2  1  2  13
4  17  1  4  3  25
5  93  3  2  8  18

【讨论】：

hmmm，如果仔细检查，它是左和外连接的差异，示例数据中的内容没有差异，但实际上应该是差异，并且我还通过 OP 注释添加了 drop_duplicates()。

【解决方案3】：

您可以避免合并并使您的代码更具可读性。合并和删除重复项时会发生什么真的不清楚。索引和多索引是为交集和其他集合操作而设计的。

common_columns = df_large.columns.intersection(df_small.columns).to_list()
df_small_as_Multiindex = pd.MultiIndex.from_frame(df_small)
df_result = df_large.set_index(common_columns).\ 
        drop(index = df_small_as_Multiindex).\ #Drop the common rows
        reset_index() #Not needed if the a,b,c columns are meaningful indexes

【讨论】：

嗯，从机器的角度来看是很清楚的。如果我阅读别人的代码而没有机会查看底层数据，我真的不明白合并删除重复操作的目标是什么。我想读的是： 1. A = x,y 的公共列 2. B = x,y 中 A 列的公共行 3. x = x 没有行 B