Python/Pandas - 创建一个新列，仅显示每个组的最大值的平均值答案

【问题标题】：Python/Pandas - Creating a new Column showing Average of only the Largest value for each groupPython/Pandas - 创建一个新列，仅显示每个组的最大值的平均值
【发布时间】：2020-05-15 17:53:00
【问题描述】：

我正在处理一个数据集，并且正在尝试创建一个新列，该列将显示一行中每个 ID 标签的平均数量，但仅基于最后一行，这是 ID 组中的最大数字。示例如下。

我当前的数据集：

    ID      Date        DaysInDuration
    NCA   11/19/2019        31                 
    NCA   12/19/2019        62              
    NCA   12/19/2019        92             
    NCA   1/19/2020         120 * Last Row
    DTT   11/19/2019        31                 
    DTT   12/19/2019        62              
    DTT   12/19/2019        92             
    DTT   1/19/2020         100 * Last Row

我正在尝试创建这个：

    ID      Date        DaysInDuration          AverageDurColumn *is only based off last row numb 
    NCA   11/19/2019        31                        30
    NCA   12/19/2019        62                        30
    NCA   12/19/2019        92                        30
    NCA   1/19/2020         120 * Last Row            30
    DTT   11/19/2019        31                        25
    DTT   12/19/2019        62                        25
    DTT   12/19/2019        92                        25
    DTT   12/29/2020        100 * Last Row            25

感谢所有可以提供帮助的人！

【问题讨论】：

标签： python pandas average

【解决方案1】：

这里给你一个简单的答案：

df['answer'] = df.groupby('ID')['DaysInDuration'].transform(lambda x: x.max()/x.count())

我只是把你的问题变成了"How do I take the maximum value per ID and divide it by the number of records that ID has?"

1.按 ID 分组

2.获取每个ID的最大值

3.除以该ID的记录数

4.使用transform将其应用于行

    ID        Date  DaysInDuration  answer
0  NCA  11/19/2019              31      30
1  NCA  12/19/2019              62      30
2  NCA  12/19/2019              92      30
3  NCA   1/19/2020             120      30
4  DTT  11/19/2019              31      25
5  DTT  12/19/2019              62      25
6  DTT  12/19/2019              92      25
7  DTT   1/19/2020             100      25

【讨论】：

谢谢你，我喜欢你的回答，它奏效了，但我也收到一条消息：SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead任何想法为什么？

【解决方案2】：

我们可以在这里使用GroupBy.transform 与last 和size：

grp = df.groupby('ID')
last = grp['DaysInDuration'].transform('last')
n = grp['DaysInDuration'].transform('size')

df['AverageDurColumn'] = last / n

    ID        Date  DaysInDuration  AverageDurColumn
0  NCA  11/19/2019              31              30.0
1  NCA  12/19/2019              62              30.0
2  NCA  12/19/2019              92              30.0
3  NCA   1/19/2020             120              30.0
4  DTT  11/19/2019              31              25.0
5  DTT  12/19/2019              62              25.0
6  DTT  12/19/2019              92              25.0
7  DTT   1/19/2020             100              25.0

【讨论】：

【解决方案3】：

试试：

import numpy as np

df["AverageDurColumn"]=np.where(df["ID"].ne(df["ID"].shift(-1)), df["DaysInDuration"], 0)

df=df.set_index("ID")
df["AverageDurColumn"]=df.groupby("ID")["AverageDurColumn"].mean()
df=df.reset_index()

输出：

    ID        Date  DaysInDuration  AverageDurColumn
0  NCA  11/19/2019              31                30
1  NCA  12/19/2019              62                30
2  NCA  12/19/2019              92                30
3  NCA   1/19/2020             120                30
4  DTT  11/19/2019              31                25
5  DTT  12/19/2019              62                25
6  DTT  12/19/2019              92                25
7  DTT   1/19/2020             100                25

【讨论】：

【解决方案4】：

一次性解决方案：

df["AverageDurColumn"]=df.groupby("ID").DaysInDuration.transform(lambda s: s.iloc[-1]/s.size)

【讨论】：

感谢@Kantal！与@MattR 的答案相比，这些数字似乎有点偏，但这可能与我在数据集上的实现有关。

【解决方案5】：

您可以使用groupby、apply 和merge：

new_df = df.merge(
  df
  .groupby(['ID'])
  .apply(lambda x: x['DaysInDuration'].max() / len(x['DaysInDuration'])
  .reset_index(),
  how='outer',
  on='ID',
)

【讨论】：