Python Pandas：按一列分组，仅在另一列中聚合，但取相应数据答案

【问题标题】：Python Pandas: groupby one column, aggregate in only one other column, but take corresponding dataPython Pandas：按一列分组，仅在另一列中聚合，但取相应数据
【发布时间】：2021-02-04 21:27:36
【问题描述】：

我已经看到了许多其他相关的 SO 问题，例如 this 和 this，但它们似乎并不是我想要的。假设我有一个这样的数据框：

import pandas as pd
df = pd.DataFrame(columns=['patient', 'parent csn', 'child csn', 'days'])
df.loc[0] = [0, 0, 10, 5]
df.loc[1] = [0, 0, 11, 3]
df.loc[2] = [0, 1, 12, 6]
df.loc[3] = [0, 1, 13, 4]
df.loc[4] = [1, 2, 20, 4]
df
Out[9]: 
  patient parent csn child csn days
0       0          0        10    5
1       0          0        11    3
2       0          1        12    6
3       0          1        13    4
4       1          2        20    4

现在我想做的是这样的：

grp_df = df.groupby(['parent csn']).min()

问题是结果计算了所有列（不是parent csn）的最小值，并产生：

grp_df
            patient  child csn  days
parent csn                          
0                 0         10     3
1                 0         12     4
2                 1         20     4

您可以看到，对于第一行，days 数字和 child csn 数字不再像分组之前那样位于同一行。这是我想要的输出：

grp_df
            patient  child csn  days
parent csn                          
0                 0         11     3
1                 0         13     4
2                 1         20     4

我怎样才能得到它？我有遍历数据框的代码，我认为它会起作用，但是即使使用 Cython，它也很慢。我觉得这应该是显而易见的，但我不这么认为。

我也查看了this 的问题，但是将child csn 放在groupby 列表中是行不通的，因为child csn 与days 不同。

This 的问题似乎更有可能，但我没有找到非常直观的解决方案。

This 的问题似乎也很可能，但同样，答案不是很直观，而且我确实希望每个 parent csn 只占一行。

另一个细节：包含最小days 值的行可能不是唯一的。在这种情况下，我只想要一排 - 我不在乎。

非常感谢您的宝贵时间！

【问题讨论】：

标签： python-3.x pandas pandas-groupby aggregate

【解决方案1】：

您可以按数据框过滤您需要的行使用 groupby 来创建过滤器，而不仅仅是使用 .groupby：

s = df.groupby('parent csn')['days'].transform('min') == df['days']
df = df[s]
df

Out[1]: 
   patient  parent csn  child csn  days
1        0           0         11     3
3        0           1         13     4
4        1           2         20     4

例如，如果我将 s 放入我的数据框中，这就是它的样子。然后，您只需过滤 True 行，这些行是每组最少天数等于该行的行。

Out[2]: 
   patient  parent csn  child csn  days      s
0        0           0         10     5  False
1        0           0         11     3   True
2        0           1         12     6  False
3        0           1         13     4   True
4        1           2         20     4   True

【讨论】：

【解决方案2】：

作为您想要的输出，您需要 sort_values 和 groupby first

df_final = (df.sort_values(['parent csn', 'patient', 'days', 'parent csn'])
              .groupby('parent csn').first())

Out[813]:
            patient  child csn  days
parent csn
0                 0         11     3
1                 0         13     4
2                 1         20     4

【讨论】：

【解决方案3】：

您可以通过使用.idxmin() 而不是.min() 来获取索引（行标识符），其中每个组的“天数”最少：

数据创建：

import pandas as pd

data = [[0, 0, 10, 5],
        [0, 0, 11, 3],
        [0, 1, 12, 6],
        [0, 1, 13, 4],
        [1, 2, 20, 4]]
df = pd.DataFrame(data, columns=['patient', 'parent csn', 'child csn', 'days'])

print(df)
   patient  parent csn  child csn  days
0        0           0         10     5
1        0           0         11     3
2        0           1         12     6
3        0           1         13     4
4        1           2         20     4

day_minimum_row_indices = df.groupby("parent csn")["days"].idxmin()

print(day_minimum_row_indices)
parent csn
0    1
1    3
2    4
Name: days, dtype: int64

从这里你可以看到组父 csn 0 在第 1 行有最少的天数。回顾我们的原始数据框，我们可以看到第 1 行有天数 == 3 并且实际上是最小值的位置父 csn == 0 的天数。父 csn == 1 在第 3 行有最少天数，依此类推。

我们可以使用行索引将子集返回到我们的原始数据帧中：

new_df = df.loc[day_minimum_row_indices]

print(new_df)
   patient  parent csn  child csn  days
1        0           0         11     3
3        0           1         13     4
4        1           2         20     4

编辑（tldr）：

df.loc[df.groupby("parent csn")["days"].idxmin()]

【讨论】：

【解决方案4】：

由于某种原因，我无法解释您的数据框包含 object 类型的列。此解决方案仅适用于数字列

df.days = df.days.astype(int)
df.iloc[df.groupby('parent csn').days.idxmin()]

输出：

  patient parent csn child csn  days
1       0          0        11     3
3       0          1        13     4
4       1          2        20     4

【讨论】：

这是因为数据框开始为空。 Pandas 不会假定其中没有任何内容的列的数据类型，因此它将它们保留为“对象”dtypes（这是最灵活的）。然后，当您通过.loc“填充”列时，它们会保留其“对象”dtype。在不相关的注释中，您还应该使用.loc 作为答案，因为idxmin() 返回与最小值关联的相应索引 - 因此如果索引为 ["a", "b", "c", "d", "e"]
@AdrianKeister - 请考虑接受@CameronRiddell's answer。这是相同的想法，但对其他读者来说更有用，并且在我的解决方案之前发布。
很好，如果你愿意的话。我可以看到它们或多或少相同。不过，您的回答很好，因为代码都在一个地方。
@MichaelSzczesny 我很感激！我在答案中添加了“tldr”。