【问题标题】：How does the pandas Compare function work?熊猫比较功能如何工作？
【发布时间】：2022-11-04 23:14:19
【问题描述】：

有人可以解释比较两个数据帧的 Pandas Compare() 函数的详细实现吗？

代码实现：

def compare(
        self,
        other,
        align_axis: Axis = 1,
        keep_shape: bool_t = False,
        keep_equal: bool_t = False,
    ):
        from pandas.core.reshape.concat import concat

        if type(self) is not type(other):
            cls_self, cls_other = type(self).__name__, type(other).__name__
            raise TypeError(
                f"can only compare '{cls_self}' (not '{cls_other}') with '{cls_self}'"
            )

        mask = ~((self == other) | (self.isna() & other.isna()))
        keys = ["self", "other"]

        if not keep_equal:
            self = self.where(mask)
            other = other.where(mask)

        if not keep_shape:
            if isinstance(self, ABCDataFrame):
                cmask = mask.any()
                rmask = mask.any(axis=1)
                self = self.loc[rmask, cmask]
                other = other.loc[rmask, cmask]
            else:
                self = self[mask]
                other = other[mask]

        if align_axis in (1, "columns"):  # This is needed for Series
            axis = 1
        else:
            axis = self._get_axis_number(align_axis)

        diff = concat([self, other], axis=axis, keys=keys)

        if axis >= self.ndim:
            # No need to reorganize data if stacking on new axis
            # This currently applies for stacking two Series on columns
            return diff

        ax = diff._get_axis(axis)
        ax_names = np.array(ax.names)

        # set index names to positions to avoid confusion
        ax.names = np.arange(len(ax_names))

        # bring self-other to inner level
        order = list(range(1, ax.nlevels)) + [0]
        if isinstance(diff, ABCDataFrame):
            diff = diff.reorder_levels(order, axis=axis)
        else:
            diff = diff.reorder_levels(order)

        # restore the index names in order
        diff._get_axis(axis=axis).names = ax_names[order]

        # reorder axis to keep things organized
        indices = (
            np.arange(diff.shape[axis]).reshape([2, diff.shape[axis] // 2]).T.flatten()
        )
        diff = diff.take(indices, axis=axis)

        return diff

【问题讨论】：

请修剪您的代码，以便更容易找到您的问题。请按照以下指南创建minimal reproducible example。
这就是熊猫比较功能？

标签： python pandas dataframe comparison

【解决方案1】：

如果你还没有读过documentation，我会从那里开始。

为了希望对高级用法有所了解，我们可以使用一些示例。

例如1——都一样

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

print(s0.compare(s1))

出去：

Empty DataFrame
Columns: [self, other]
Index: []

根据文档，.compare 方法应该只返回以下行不是等于self（即s0）和other（即s1）。在上面，s1 是s0 的精确副本。因此所有行都应该完全相等。因此返回一个空的DataFrame。

例如2 -- 不同

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

# change the 0th element in `s1` to something else
s1.iloc[0] = "a different value"

print(s0.compare(s1))

出去：

       self              other
0  0.548814  a different value

通过更改s1 中的单个元素，我们可以看到.compare 的标准用法。结果帧将有两列（"self" 和"other"）。 s0 中第 0 行的值是一些浮点数，s1 中的不同值是一个字符串。明显不同，如结果所示。

例如2 -- `keep_shape=True`

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

# change the 0th element in `s1` to something else
s1.iloc[0] = "a different value"

print(s0.compare(s1, keep_shape=True))

出去：

       self              other
0  0.548814  a different value
1       NaN                NaN
2       NaN                NaN
3       NaN                NaN
4       NaN                NaN

keep_shape 参数的文档说：

keep_shape : 布尔型，默认为 False
    如果为真，则保留所有行和列。
    否则，仅保留具有不同值的那些。
因为我们将参数从默认值False 更改为True，所以.compare 将返回一个与s0 和s1 具有相同行数的DataFrame。这个参数的逻辑可以在here找到。

例如3 -- keep_equal=True
import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

# change the 0th element in `s1` to something else
s1.iloc[0] = "a different value"

print(s0.compare(s1, keep_equal=True))
出去：
       self              other
0  0.548814  a different value
keep_equal 参数的文档说：
keep_equal : bool，默认 False
    如果为真，则结果保持相等的值。
    否则，相等的值显示为 NaN。
基于此，您可能会认为结果应该限于s0 和s1 相同的行。但事实并非如此。为什么？这个参数的逻辑很短，可以在here找到。如果 keep_equal 设置为 True，则将跳过条件，self 和 other 将不会应用 mask。

但！再往下in the keep_shape conditional，您会看到mask 像布尔过滤器一样应用，删除了mask 为False 的行。因此，在比较两个 Series 时，更改 keep_equal 参数实际上并没有做任何事情。我在pandas 中有一个问题logged 记录了这一点。

额外的东西

align_axis 参数基本上是结果的转置。

result_names 允许您更改输出中列的名称（默认为"self" 和"other"）。

比较DataFrame 实例的操作方式类似，但结果将有一个MultiIndex 列（级别0 是列的名称，级别1 是result_names 参数中的名称）。

【讨论】：

例如1——都一样

例如2 -- 不同

例如2 -- keep_shape=True

例如3 -- keep_equal=True

额外的东西

例如2 -- `keep_shape=True`

例如3 -- `keep_equal=True`