Pandas groupby 聚合产生无关组。漏洞？答案

【问题标题】：Pandas groupby aggregation yields extraneous groups. Bug?Pandas groupby 聚合产生无关组。漏洞？
【发布时间】：2021-08-03 06:15:29
【问题描述】：

pandas 中的数据操作存在一些问题，这似乎是 pandas 的错误？会喜欢一些想法。

我有一个（按索引排序的）数据框 my_df，它看起来像这样：

                                                                          value1  value2
col0                                 col1             col2 col3     col4
0035ca76-b209-4c4e-9bba-b18459c4dceb 203          positive 173       148   0.0    0.086892
                                                  negative 1156      148   0.0    0.090347
                                                           1157      148   0.0    0.090347
                                                           1158      148   0.0    0.090347
                                                           1159      148   0.0    0.084884
                                                           1160      148   0.0    0.079942
                                                           1161      148   0.0    0.079824
                                                           1162      148   0.0    0.071289
                                                  positive 173       66    0.0    0.079831
                                                  negative 1156      66    0.0    0.082660
                                                           1157      66    0.0    0.082660
                                                           1158      66    0.0    0.082660
                                                           1159      66    0.0    0.084353
                                                           1160      66    0.0    0.076934
                                                           1161      66    0.0    0.076494
                                                           1162      66    0.0    0.070424
00e35aaf-050a-4f09-bf94-df994e4bf681 24           positive 14        38    0.0    0.073936
                                                  negative 134       38    0.0    0.075913
                                                           135       38    0.0    0.075913
                                                           136       38    0.0    0.074403
                                                           137       38    0.0    0.081120
                                                           138       38    0.0    0.078560
                                                           139       38    0.0    0.080680
                                                           140       38    0.0    0.073892
                                                  positive 14        1     0.0    0.051979
                                                  negative 134       1     0.0    0.043818
                                                           135       1     0.0    0.043818
                                                           136       1     0.0    0.049795
                                                           137       1     0.0    0.052171
                                                           138       1     0.0    0.048573
                                                           139       1     0.0    0.045205
                                                           140       1     0.0    0.054696
... more rows for this and other col0 + col1 combos

我正在尝试为 [col0, col1, col2, col3] 的每个唯一组合计算“value2”的总和。据我所知，最合乎逻辑的方法是

my_df.groupby(level=list(range(4))).sum()

但是，我得到了非常奇怪的结果，看起来就像是熊猫错误。

        grouped = my_df.groupby(list(range(4)))
        for name, group in grouped:
            print(group)
            break
        sums = grouped.sum()

确实第一组如我所料

                                                                            value1  value2
col0                                 col1             col2 col3     col4
0035ca76-b209-4c4e-9bba-b18459c4d681 199          positive 174       151    0.0     0.089186
                                                                     158    0.0     0.104250

grouped 中的组数是正确的（你必须相信我的话，我已经验证了其他方式）但是sums 是混乱的并且有一个 bajillion多余的行

(Pdb) len(grouped)
334
(Pdb) len(sums)
53760
(Pdb) sums[:30]
                                                                             value1     valu2
col0                                 col1         col2     col3      col4
1f11aede-6aed-44ef-9296-004b6269662c 17           positive 7         1       0.0        0.0
                                                                     4       0.0        0.0
                                                                     5       0.0        0.0
                                                                     6       0.0        0.0
                                                                     7       0.0        0.0
                                                                     8       0.0        0.0
                                                                     11      0.0        0.0
                                                                     12      0.0        0.0
                                                                     24      0.0        0.0
                                                                     32      0.0        0.0
                                                                     33      0.0        0.0
                                                                     38      0.0        0.0
                                                                     39      0.0        0.0
                                                                     53      0.0        0.0
                                                                     56      0.0        0.0
                                                                     66      0.0        0.0
                                                                     69      0.0        0.0
                                                                     70      0.0        0.0
                                                                     72      0.0        0.0
                                                                     73      0.0        0.0
                                                                     75      0.0        0.0
                                                                     85      0.0        0.0
                                                                     91      0.0        0.0
                                                                     94      0.0        0.0
                                                                     116     0.0        0.0
                                                                     119     0.0        0.0

col4 中给出的值在整个数据框中变化很大。看起来 groupby + 聚合操作为整个数据帧中的每个 col4 值创建了一个总和行，而不是实际上与每个组相关的 col4 值。换句话说，这些行中的大多数甚至在原始数据框中都没有条目：

(Pdb) my_df.loc[("1f11aede-6aed-44ef-9296-004b6269662c", 17, "positive", 7, 1)]
*** KeyError: ('1f11aede-6aed-44ef-9296-004b6269662c', 17, 'positive', 7, 1)

知道这里发生了什么吗？这些似乎完全脱离了 groupby API 和教程描述的脚本。例如，据我所知 groupby => agg 应该在这里为每个组创建一行。

【问题讨论】：

标签： pandas pandas-groupby

【解决方案1】：

TLDR：从 pandas 1.2.4 开始，groupby 在其中一个索引为分类时具有非直观行为。用groupby(..., observed=True) 修复。

好的，这个花了我一段时间。事实证明，如果您的索引之一是 Categorical，那么在我看来，groupby 的行为完全不直观。

# Does some sort of cartesian products of index values if any of the indexes are Categorical.
my_df.groupby(level=list(range(4)))
# Doesn't do the cartesian product, in line with behavior for every other index type.
my_df.groupby(level=list(range(4)), observed=True)

就我而言，col2 是分类的。也就是说，在我的代码中，我有：

col2_type = pd.api.types.CategoricalDtype(categories=["positive", "negative"], ordered=True)
col2 = ["positive", "negative", "negative", "negative", #...]
# Make col2 categorical
my_df['col2'] = my_df.assign(col2=col2)['col2'].astype(col2_type)

对此有点过失，这种行为是 documented（参见observed：）。话虽如此，我并不是唯一一个对此彻底感到困惑的人。

运气好的话，这个“问题”将在即将发布的版本中得到修复。

【讨论】：

【解决方案2】：

尝试以下句子通过分组slice_id、col1、col2、col3 列数据得到value2 列的总和

my_df.groupby(['slice_id', 'col1', 'col2', 'col3']).agg('sum')

或

my_df.groupby(['slice_id', 'col1', 'col2', 'col3'])[['value2']].agg('sum')

【讨论】：