熊猫在 groupby 中返回空组答案

【问题标题】：Pandas returning empty groups in groupby熊猫在 groupby 中返回空组
【发布时间】：2017-03-09 07:30:06
【问题描述】：

我有一个包含 3 列的 Pandas DataFrame，target、pred 和 conf_bin。如果我运行groupby(by='conf_bin').apply(...)，我的apply 函数将调用空DataFrames，以获取未出现在conf_bin 列中的值。这怎么可能？

详情

DataFrame 看起来像这样：

        target  pred conf_bin
0            5     6     0.50
1            4     4     0.60
2            4     4     0.50
3            4     3     0.50
4            4     5     0.50
5            5     5     0.55
6            5     5     0.55
7            5     5     0.55

显然conf_bin 是一个数值区间，其值在np.arange(0, 1, 0.05) 范围内。但是，并非所有值都存在于数据中：

In [224]: grp = tp.groupby(by='conf_bin')

In [225]: grp.groups.keys()
Out[225]: dict_keys([0.5, 0.60000000000000009, 0.35000000000000003, 0.75, 0.85000000000000009, 0.65000000000000002, 0.55000000000000004, 0.80000000000000004, 0.20000000000000001, 0.45000000000000001, 0.40000000000000002, 0.30000000000000004, 0.70000000000000007, 0.25])

因此，例如，值 0 和 0.05 不会出现。但是，当我在组上运行 apply 时，我的函数确实会为这些值调用：

In [226]: grp.apply(lambda x: x.shape)
Out[226]:
conf_bin
0.00        (0, 3)
0.05        (0, 3)
0.10        (0, 3)
0.15        (0, 3)
0.20       (22, 3)
0.25       (75, 3)
0.30       (95, 3)
0.35      (870, 3)
0.40     (8505, 3)
0.45    (40068, 3)
0.50    (51238, 3)
0.55    (54305, 3)
0.60    (47191, 3)
0.65    (38977, 3)
0.70    (34444, 3)
0.75    (20435, 3)
0.80     (3352, 3)
0.85        (4, 3)
0.90        (0, 3)
dtype: object

问题：

Pandas 怎么知道值 0.0 和 0.5 “有意义”，因为它们没有出现在我的 DataFrame 中？
为什么它用空的DataFrame 对象调用我的apply 函数以获得grp.groups 中没有出现的值？

【问题讨论】：

您能否提供一个独立的示例，其中包含演示问题的示例数据？
dtypes 是什么？它们是否可能与类别规范中所有垃圾箱的信息分类？
@piRSquared 是正确的。 conf_bin 的 dtype 是 category。谢谢！！
分类案例请参考stackoverflow.com/a/50579578/4755520。 TL；DR 使用.groupby(..., observed=True)。

标签： python pandas

【解决方案1】：

我也遇到了这个问题，在尝试为我的数据框中的每个类别创建子图时弹出。

我想出了以下解决方法（基于this SO post），将非空组拉到一个列表中。

groups = df.groupby('conf_bin')
group_list = [(index, group) for index, group in groups if len(group) > 0]

它确实打破了“你在 pandas 中处理你的数据”的隐含契约，并且可能对内存管理不善，但它确实有效。

现在您可以使用与使用 groupby 对象相同的接口来遍历您的 groupby 列表，例如

fig, axes = plt.subplots(nrows=len(group_list), ncols=1)
for (index, group), ax in zip(group_list, axes.flatten()):
    group['target'].plot(ax=ax, title=index)

【讨论】：