如何通过考虑当前日期和以前的所有日期数据来获得每个日期的平均值答案

【问题标题】：How to get average for each date by considering current date and previous all date data如何通过考虑当前日期和以前的所有日期数据来获得每个日期的平均值
【发布时间】：2020-12-22 14:23:55
【问题描述】：

我有以下数据框，我想通过考虑当前日期和以前的所有日期数据来计算每个日期的累积平均值。

df = pd.DataFrame({'Items':['Item1', 'Item2', 'Item1', 'Item2', 'Item1', 'Item2', 'Item1', 'Item2', 'Item1', 'Item2', 'Item1'],
         'Variable': ['V1', 'V2', 'V1', 'V2', 'V1', 'V2', 'V1', 'V2', 'V1', 'V2', 'V1'],
         'Date': ['2020-12-16', '2020-12-16', '2020-12-16', '2020-12-16', '2020-12-17', '2020-12-17', '2020-12-17', '2020-12-17', '2020-12-18', '2020-12-18', '2020-12-18'],
         'Value': [5, 2, 5, 1, 1, 1, 1, 2, 1, 1, 1]})

df = df.sort_values(['Date'], ascending=[True])

但下面的脚本没有帮助：

df.groupby(['Items', 'Variable', 'Date'])['Value'].expanding().mean().reset_index(name='Value')

我需要如下结果：

在 MS excel 中，我们可以通过选择所有以前的行来找到最新日期 -2018 的平均值，如下所示：

如上所述，我想计算所有日期

【问题讨论】：

标签： python python-3.x pandas dataframe pandas-groupby

【解决方案1】：

试试：

df = df.sort_values(['Items', 'Date', 'Variable'], ascending=[True, True, True])

x = df.reset_index().groupby(['Items', 'Variable'])['Value']
index = x.cumcount()+1
df['Value'] = x.cumsum()/(index.values)

df1 = df[np.where(df[['Items', 'Variable', 'Date']].duplicated(keep='last'), False, True)].reset_index(drop=True)

df1:

    Items   Variable    Date    Value
0   Item1   V1       2020-12-16 5.000000
1   Item1   V1       2020-12-17 3.000000
2   Item1   V1       2020-12-18 2.333333
3   Item2   V2       2020-12-16 1.500000
4   Item2   V2       2020-12-17 1.500000
5   Item2   V2       2020-12-18 1.400000

编辑：

而不是df[np.where(df[['Items', 'Variable', 'Date']].duplicated(keep='last'), False, True)].reset_index(drop=True)

使用df.drop_duplicates(subset=['Items', 'Variable', 'Date'], keep='last').reset_index(drop=True)

【讨论】：

因为(x.index+1)，我没有得到确切的结果
@Pygirl 非常感谢。这是有效的。提前祝圣诞快乐，新年快乐。谢谢:)
@Pygirl 使用cumsum 和cumcount +1 的好主意。我认为您可以使用drop_duplicates 代替np.where + duplicated，也无需指定@ 987654331@参数其默认值为True。
@Pygirl 如果你喜欢我可以编辑答案:)
我更新了我的答案。我正在考虑使用 drop_duplicate 但无法编写（语法）您的评论让我清楚地了解如何使用它np.where + duplicated --> drop_duplicate。谢谢@ShubhamSharma

【解决方案2】：

从我从你的 excel 表中可以看到，这就是你想要做的：

df = df.sort_values(['Variable','Date'], ascending=[True,True])
df['cummean'] = df.groupby(['Variable'])['Value'].transform(lambda x: x.rolling(6,1).mean())

    Items Variable        Date  Value   cummean
0   Item1       V1  2020-12-16      5  5.000000
2   Item1       V1  2020-12-16      1  3.000000
4   Item1       V1  2020-12-17      1  2.333333
6   Item1       V1  2020-12-17      1  2.000000
8   Item1       V1  2020-12-18      1  1.800000
10  Item1       V1  2020-12-18      1  1.666667
1   Item2       V2  2020-12-16      5  5.000000
3   Item2       V2  2020-12-16      1  3.000000
5   Item2       V2  2020-12-17      1  2.333333
7   Item2       V2  2020-12-17      2  2.250000
9   Item2       V2  2020-12-18      1  2.000000

【讨论】：

但它给出了重复项