熊猫在相对时间范围内通过另一个值获得平均值答案

【问题标题】：Pandas get mean value within relative time range by another value熊猫在相对时间范围内通过另一个值获得平均值
【发布时间】：2022-01-11 08:05:57
【问题描述】：

我正在使用带有 DatetimeIndex 和两个附加列 A 和 B 的 DataFrame，并尝试提供一个输出 DataFrame 来回答如下问题：

在最早出现A 后的 6-12 个月内确定每个 A 的平均值 B

我一直在使用 pd.Grouper 并了解如何在存储桶中对 DateTime 索引进行分组（例如 df.groupby(pd.Grouper(freq='M')).mean()），但不清楚如何计算自每个值 A 最早观察以来的一段时间内的平均值在数据集中。

输入 DataFrame 类似于：

data = {
    'A': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y'],
    'B': [10, 32, 12, 13, 24, 32, 12, 72, 90],
    'created_on': [
        '2018-01-31', 
        '2019-02-25', 
        '2018-02-12', 
        '2019-05-31', 
        '2021-03-12', 
        '2020-04-23', 
        '2016-01-11', 
        '2016-05-02', 
        '2018-12-31',
    ]
}

df = pd.DataFrame(data)
df = df.set_index(pd.to_datetime(df['created_on']))
df.drop(['created_on'], axis=1, inplace=True)

这会生成一个如下所示的 DataFrame：


+------------+---+----+
| created_on | A | B  |
+------------+---+----+
| 2018-01-31 | x | 10 |
| 2019-02-25 | x | 32 |
| 2019-05-31 | x | 13 |
| 2021-03-12 | y | 24 |
| 2016-05-02 | y | 72 |
| ...        | . | .. |
+------------+---+----+

目标是所需的输出，形状如下：

+---+----------------------------------------------+
| A | avg_B_6_12_months_after_earliest_observation |
+---+----------------------------------------------+
| x |                                         12.2 |
| y |                                         18.1 |
+---+----------------------------------------------+

上面avg_B_6_12_months_after_earliest_observation 列中的值仅作为示例，它们与示例输入DataFrame 中提供的值无关。

【问题讨论】：

请添加一个示例，说明如何计算 x 的平均值
您能否详细说明第 6-12 个月内的每个A？我不明白你是如何得到 x 和 y 的 12.2 和 18.1
@Chris 道歉，更新了问题以澄清。我添加的数字只是为了提供示例值，它们与示例输入并不相关。
好的，但仍然对逻辑感到困惑。您想要A 中每个项目的平均值，不包括它们的第一次出现，以及只有那些月份在 6 到 12 之间？
@Chris 对于数据中的每个A，我正在寻找数据中第一次观察到A 后第6 个月到第12 个月之间的平均B。对于数据中的每个客户A，类似“客户A 在他们下第一个订单后的 6-12 个月内的所有订单中平均花费了B”。

标签： python pandas dataframe time-series

【解决方案1】：

一个想法是使用groupby.transform 和idxmin 在每行对齐A 中每个元素的第一次出现的日期。然后您可以将索引与第一次出现的值相加6 或12 个月。在 loc 中使用它来选择想要的行，groupby 和 mean。

# working dummy example
df = pd.DataFrame(
    {'A': ['x', 'x', 'x', 'x', 'y', 'y', 'y', 'y', 'y'],
     'B': [10, 32, 12, 13, 24, 32, 12, 72, 90],},
    index = pd.to_datetime(
        ['2018-01-31', '2018-02-25', '2018-08-12', 
         '2018-10-31', '2021-03-12', '2016-10-23', 
         '2016-01-11', '2016-05-02', '2016-12-31'])
)
print(df)

# helper series with aligned index and idxmin per A
s = df.groupby(['A'])['B'].transform('idxmin')
print(s)
# 2018-01-31   2018-01-31
# 2018-02-25   2018-01-31 # here first date if x align with this row
# 2018-08-12   2018-01-31
# 2018-10-31   2018-01-31
# 2021-03-12   2016-01-11
# 2016-10-23   2016-01-11
# 2016-01-11   2016-01-11
# 2016-05-02   2016-01-11
# 2016-12-31   2016-01-11
# Name: B, dtype: datetime64[ns]

现在你可以得到结果了

res = (
    # select rows with date in the 6-12 months after 1rst occurrence
    df.loc[(s.index>=s+pd.DateOffset(months=6)) 
          & (s.index<=s+pd.DateOffset(months=12))]
       # groupby and mean
      .groupby('A')['B'].mean()
      # cosmetic to fit expected output
      .rename('avg_B_6_12_months_after_earliest_observation')
      .reset_index()
)
print(res)
#    A  avg_B_6_12_months_after_earliest_observation
# 0  x                                          12.5
# 1  y                                          61.0

【讨论】：

【解决方案2】：

IIUC，你可以定义一个自定义函数并应用到pandas.DataFrame.groupby：

def filtersum(data):
    data = data.iloc[1:]
    ind = data.index[data.index.month.to_series().between(6, 12)]
    return data.loc[ind, "B"].mean()

new_df = df.sort_index().groupby("A", as_index=False).apply(filtersum)
print(new_df)

输出：

   A   NaN
0  x   NaN
1  y  90.0

逻辑：

data.iloc[1:]：从计算中排除第一个观察结果
data.index[data.index.to_series().between(6, 12)]：过滤月份在 6 到 12（含）之间的索引。
df.sort_index().groupby：数据必须按其索引排序，以便排除的第一个观察确实是按时间顺序排列的第一个。

注意（基于样本数据）：

客户x 在 6 月到 12 月之间没有消费：

            A   B
created_on       
2018-01-31  x  10
2018-02-12  x  12
2019-02-25  x  32
2019-05-31  x  13

客户y只在2018-12-31消费过一次：

            A   B
created_on       
2016-01-11  y  12
2016-05-02  y  72
2018-12-31  y  90 <<<
2020-04-23  y  32
2021-03-12  y  24

【讨论】：