将 df.groupby....max() 结果输入到新列中。熊猫答案

【问题标题】：Entering df.groupby....max() result into a new column. Pandas将 df.groupby....max() 结果输入到新列中。熊猫
【发布时间】：2021-12-25 08:41:54
【问题描述】：

我的数据与"Cricket"，体育游戏（如棒球）有关。它有20 overs for each inning max 和each over has approx 6 balls。

数据：

        season  match_id    inning  sum_total_runs  sum_total_wickets   over/ball   innings_score
32      2008    60          1       61              0                   5.1         0
33      2008    60          1       61              1                   5.2         0
34      2008    60          1       61              1                   5.3         0
35      2008    60          1       61              1                   5.4         0
36      2008    60          1       61              1                   5.5         0
...     ...     ...         ...     ...             ...                 ...         ...
179073  2019    11415       2       152             5                   19.2        0
179074  2019    11415       2       154             5                   19.3        0
179075  2019    11415       2       155             6                   19.4        0
179076  2019    11415       2       157             6                   19.5        0 
179077  2019    11415       2       157             7                   19.6        0

111972 行 × 7 列

innings_score 是我创建的新列（给定默认值 0）。我想更新它。我要输入的值是下面df.groupby 的结果。

In[]:
df.groupby(['season', 'match_id', 'inning'])['sum_total_runs'].max()

Out[]:
season  match_id  inning
2008    60        1         222
                  2          82
        61        1         240
                  2         207
        62        1         129
                           ... 
2019    11413     2         170
        11414     1         155
                  2         162
        11415     1         152
                  2         157
Name: sum_total_runs, Length: 1276, dtype: int64

我希望innings_score 是这样的：

        season  match_id    inning  sum_total_runs  sum_total_wickets   over/ball   innings_score
32      2008    60          1       61              0                   5.1         222
33      2008    60          1       61              1                   5.2         222
34      2008    60          1       61              1                   5.3         222
35      2008    60          1       61              1                   5.4         222
36      2008    60          1       61              1                   5.5         222
...     ...     ...         ...     ...             ...                 ...         ...
179073  2019    11415       2       152             5                   19.2        157
179074  2019    11415       2       154             5                   19.3        157
179075  2019    11415       2       155             6                   19.4        157
179076  2019    11415       2       157             6                   19.5        157
179077  2019    11415       2       157             7                   19.6        157

111972 行 × 7 列

【问题讨论】：

标签： python pandas group-by

【解决方案1】：

一种方法是将这 3 列设置为索引，并将 groupby 结果分配为新列，然后重置索引。

虽然这些列是索引，但 grouby 结果和数据框都具有相似的索引，因此 pandas 会自动匹配并在正确的位置插入正确的行。然后重置索引会将它们变回普通列。

类似这样的：

In [46]: df
Out[46]:
   season  match_id  inning  sum_total_runs  sum_total_wickets  over/ball
0    2008        60       1              61                  0        5.1
1    2008        60       1              61                  1        5.2
2    2008        60       1              61                  1        5.3
3    2008        60       1              61                  1        5.4
4    2008        60       1              61                  1        5.5
5    2019     11415       2             152                  5       19.2
6    2019     11415       2             154                  5       19.3
7    2019     11415       2             155                  6       19.4
8    2019     11415       2             157                  6       19.5
9    2019     11415       2             157                  7       19.6

In [47]: df.set_index(['season', 'match_id', 'inning']).assign(innings_score=df.groupby(['season', 'match_id', 'inning'])['sum_total_runs'].max()).reset_index()
Out[47]:
   season  match_id  inning  sum_total_runs  sum_total_wickets  over/ball  innings_score
0    2008        60       1              61                  0        5.1             61
1    2008        60       1              61                  1        5.2             61
2    2008        60       1              61                  1        5.3             61
3    2008        60       1              61                  1        5.4             61
4    2008        60       1              61                  1        5.5             61
5    2019     11415       2             152                  5       19.2            157
6    2019     11415       2             154                  5       19.3            157
7    2019     11415       2             155                  6       19.4            157
8    2019     11415       2             157                  6       19.5            157
9    2019     11415       2             157                  7       19.6            157

【讨论】：

列中没有['season', 'match_id', 'inning']
你遇到这个错误了吗？从您的问题来看，它们似乎在列中。确保不要对 groupby 的结果执行此操作，而是在原始数据帧上执行此操作。
是的，它们在我的 df 中，我在 orig df 上这样做了。
你能用你运行的代码和你看到的回溯更新问题吗？如果数据框看起来像您的原始问题，我看不出它会如何出错，如果这些列不存在，您在问题中所做的 groupby 也会失败。
感谢我再次运行了整个文件，它运行良好。

【解决方案2】：

我会使用assign。从一个简单的例子开始：

import pandas as pd

dt = pd.DataFrame({"name1":["A", "A", "B", "B", "C", "C"], "name2":["C", "C", "C", "D", "D", "D"], "value":[1, 2, 3, 4, 5, 6]})
grouping_variables = ["name1", "name2"]
dt = dt.set_index(grouping_variables)
dt = dt.assign(new_column=dt.groupby(grouping_variables)["value"].max())

如您所见，在运行作业之前，您将 grouping_variables 设置为 indeces。

如果您不想保留grouping_variables 索引数据框，您可以随时重置索引：

dt.reset_index()

【讨论】：

列中没有['season', 'match_id', 'inning']
感谢我再次运行了整个文件，它运行良好。
在粘贴您的代码之前，我已经将这三列作为索引。所以，它是说没有这样的列。