【问题标题】:How does the pandas Compare function work?熊猫比较功能如何工作?
【发布时间】:2022-11-04 23:14:19
【问题描述】:

有人可以解释比较两个数据帧的 Pandas Compare() 函数的详细实现吗?

代码实现:

def compare(
        self,
        other,
        align_axis: Axis = 1,
        keep_shape: bool_t = False,
        keep_equal: bool_t = False,
    ):
        from pandas.core.reshape.concat import concat

        if type(self) is not type(other):
            cls_self, cls_other = type(self).__name__, type(other).__name__
            raise TypeError(
                f"can only compare '{cls_self}' (not '{cls_other}') with '{cls_self}'"
            )

        mask = ~((self == other) | (self.isna() & other.isna()))
        keys = ["self", "other"]

        if not keep_equal:
            self = self.where(mask)
            other = other.where(mask)

        if not keep_shape:
            if isinstance(self, ABCDataFrame):
                cmask = mask.any()
                rmask = mask.any(axis=1)
                self = self.loc[rmask, cmask]
                other = other.loc[rmask, cmask]
            else:
                self = self[mask]
                other = other[mask]

        if align_axis in (1, "columns"):  # This is needed for Series
            axis = 1
        else:
            axis = self._get_axis_number(align_axis)

        diff = concat([self, other], axis=axis, keys=keys)

        if axis >= self.ndim:
            # No need to reorganize data if stacking on new axis
            # This currently applies for stacking two Series on columns
            return diff

        ax = diff._get_axis(axis)
        ax_names = np.array(ax.names)

        # set index names to positions to avoid confusion
        ax.names = np.arange(len(ax_names))

        # bring self-other to inner level
        order = list(range(1, ax.nlevels)) + [0]
        if isinstance(diff, ABCDataFrame):
            diff = diff.reorder_levels(order, axis=axis)
        else:
            diff = diff.reorder_levels(order)

        # restore the index names in order
        diff._get_axis(axis=axis).names = ax_names[order]

        # reorder axis to keep things organized
        indices = (
            np.arange(diff.shape[axis]).reshape([2, diff.shape[axis] // 2]).T.flatten()
        )
        diff = diff.take(indices, axis=axis)

        return diff

【问题讨论】:

  • 请修剪您的代码,以便更容易找到您的问题。请按照以下指南创建minimal reproducible example
  • 这就是熊猫比较功能?

标签: python pandas dataframe comparison


【解决方案1】:

如果你还没有读过documentation,我会从那里开始。

为了希望对高级用法有所了解,我们可以使用一些示例。

例如1——都一样

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

print(s0.compare(s1))

出去:

Empty DataFrame
Columns: [self, other]
Index: []

根据文档,.compare 方法应该只返回以下行不是等于self(即s0)和other(即s1)。在上面,s1s0 的精确副本。因此所有行都应该完全相等。因此返回一个空的DataFrame

例如2 -- 不同

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

# change the 0th element in `s1` to something else
s1.iloc[0] = "a different value"

print(s0.compare(s1))

出去:

       self              other
0  0.548814  a different value

通过更改s1 中的单个元素,我们可以看到.compare 的标准用法。结果帧将有两列("self""other")。 s0 中第 0 行的值是一些浮点数,s1 中的不同值是一个字符串。明显不同,如结果所示。

例如2 -- keep_shape=True

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

# change the 0th element in `s1` to something else
s1.iloc[0] = "a different value"

print(s0.compare(s1, keep_shape=True))

出去:

       self              other
0  0.548814  a different value
1       NaN                NaN
2       NaN                NaN
3       NaN                NaN
4       NaN                NaN

keep_shape 参数的文档说:

keep_shape : 布尔型,默认为 False
    如果为真,则保留所有行和列。
    否则,仅保留具有不同值的那些。

因为我们将参数从默认值False 更改为True,所以.compare 将返回一个与s0s1 具有相同行数的DataFrame。这个参数的逻辑可以在here找到。

例如3 -- keep_equal=True

import numpy as np  # 1.23.3
import pandas as pd  # 1.5.1

np.random.seed(0)

s0 = pd.Series(np.random.random(size=(5)))
s1 = s0.copy()

# change the 0th element in `s1` to something else
s1.iloc[0] = "a different value"

print(s0.compare(s1, keep_equal=True))

出去:

       self              other
0  0.548814  a different value

keep_equal 参数的文档说:

keep_equal : bool,默认 False
    如果为真,则结果保持相等的值。
    否则,相等的值显示为 NaN。

基于此,您可能会认为结果应该限于s0s1 相同的行。但事实并非如此。为什么?这个参数的逻辑很短,可以在here找到。如果 keep_equal 设置为 True,则将跳过条件,selfother 将不会应用 mask

但!再往下in the keep_shape conditional,您会看到mask 像布尔过滤器一样应用,删除了maskFalse 的行。因此,在比较两个 Series 时,更改 keep_equal 参数实际上并没有做任何事情。我在pandas 中有一个问题logged 记录了这一点。

额外的东西

  • align_axis 参数基本上是结果的转置。
  • result_names 允许您更改输出中列的名称(默认为"self""other")。
  • 比较DataFrame 实例的操作方式类似,但结果将有一个MultiIndex 列(级别0 是列的名称,级别1 是result_names 参数中的名称)。

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2017-12-23
    • 2017-07-29
    • 2021-11-21
    • 1970-01-01
    • 2014-03-31
    • 1970-01-01
    • 2021-11-02
    • 2016-03-08
    相关资源
    最近更新 更多