根据日期范围计算运行平均值答案

【问题标题】：Calculate running average based on date range根据日期范围计算运行平均值
【发布时间】：2021-12-29 05:16:29
【问题描述】：

我有一个数据集，其中包含客户 ID、他/她订购商品的日期和他/她的发票金额。下面的可重现示例：

client_id_ex<-c("0001","0001","0001","0001","0002","0002","0002","0002","0002","0002","0002")
order_date_ex<-as.Date(c("12-05-2000","02-01-2001","11-11-2020","03-05-2021","12-05-2000","16-05-2000","12-06-2000","13-08-2000","19-05-2004","12-09-2007","08-12-2008"),format="%d-%m-%Y")
invoice_ex<-c(450,100,200,330,543,665,334,753,234,541,1000)
df<-data.frame(client_id_ex,order_date_ex,invoice_ex)

我想分别计算每个客户的发票的移动平均值，以及计算每个订单前不早于 5 年的订单的平均值。

结果如下所示：

client_id_ex   order_date_ex   invoice_ex   avg_invoice_5
1              12.05.2000      450          450
1              02.01.2001      100          275
1              11.11.2020      200          200
1              03.05.2021      330          265
2              12.05.2000      543          543
2              16.05.2000      665          604
2              12.06.2000      334          514
2              13.08.2000      753          574
2              19.05.2004      234          506
2              12.09.2007      541          388
2              08.12.2008      999          591

有人知道怎么做吗？我尝试使用：Calculate average based on date range in R，但由于我必须计算更像移动平均线的东西并分别为每个客户执行此操作，因此我没有从这个示例中得到太多。

【问题讨论】：

标签： r date moving-average

【解决方案1】：

这是使用tidyverse 的一种方法。它使用purrr::map 计算每个客户在每个日期和五年前（5*365.25 天）之间的发票的平均值。

library(tidyverse)

df %>%
    group_by(client_id_ex) %>% 
    mutate(roll_mean = map_dbl(order_date_ex, 
                               ~mean(invoice_ex[(order_date_ex >= (. - 5 * 365.25)) & 
                                                  (order_date_ex <= .)])))
# A tibble: 11 x 4
# Groups:   client_id_ex [2]
   client_id_ex order_date_ex invoice_ex roll_mean
   <chr>        <date>             <dbl>     <dbl>
 1 0001         2000-05-12           450      450 
 2 0001         2001-01-02           100      275 
 3 0001         2020-11-11           200      200 
 4 0001         2021-05-03           330      265 
 5 0002         2000-05-12           543      543 
 6 0002         2000-05-16           665      604 
 7 0002         2000-06-12           334      514 
 8 0002         2000-08-13           753      574.
 9 0002         2004-05-19           234      506.
10 0002         2007-09-12           541      388.
11 0002         2008-12-08          1000      592.

【讨论】：

非常感谢！

【解决方案2】：

我认为您是在累积平均值/平均值而不是滚动平均值/平均值。

这是一种选择：

df %>%
    group_by(client_id_ex) %>%
    mutate(grp = cumsum(c(TRUE, (diff(order_date_ex) > 5 * 365)))) %>%
    group_by(client_id_ex, grp) %>%
    mutate(avg_invoice_5 = cummean(invoice_ex)) %>%
    ungroup() %>%
    select(-grp)
## A tibble: 11 x 4
#  client_id_ex order_date_ex invoice_ex avg_invoice_5
#  <chr>        <date>             <dbl>         <dbl>
# 1 0001         2000-05-12           450          450 
# 2 0001         2001-01-02           100          275 
# 3 0001         2020-11-11           200          200 
# 4 0001         2021-05-03           330          265 
# 5 0002         2000-05-12           543          543 
# 6 0002         2000-05-16           665          604 
# 7 0002         2000-06-12           334          514 
# 8 0002         2000-08-13           753          574.
# 9 0002         2004-05-19           234          506.
#10 0002         2007-09-12           541          512.
#11 0002         2008-12-08          1000          581.

我承认我不理解（也无法重现）您最后两行的输出。我认为这是一个错误？ client_id_ex = 0002 的所有发票日期都在 5 年内。

【讨论】：

我也不明白最后两行的区别 - 考虑到我也使用了累积平均值，即使我不知道它和简单滚动平均值之间的区别