按类别计算基于日期的累积产品答案

【问题标题】：Calculate cumulative product based on date by category按类别计算基于日期的累积产品
【发布时间】：2013-05-06 21:09:27
【问题描述】：

我想在我的 data.table 中添加一个新列，其中包含基于Date 的Data1 的累积乘积。应为每个类别 (Cat) 计算累积产品，并应从最新可用的 Date 开始。

样本数据：

     DF = data.frame(Cat=rep(c("A","B"),each=4), Date=rep(c("01-08-2013","01-07-2013","01-04-2013","01-03-2013"),2), Data1=c(1:8))
     DF$Date = as.Date(DF$Date , "%m-%d-%Y")
     DT = data.table(DF)
     DT[ , Data1_cum:=NA_real_]
     DT

        Cat      Date Data1 Data1_cum
     1:  A 2013-01-08     1    NA
     2:  A 2013-01-07     2    NA
     3:  A 2013-01-04     3    NA
     4:  A 2013-01-03     4    NA
     5:  B 2013-01-08     5    NA
     6:  B 2013-01-07     6    NA
     7:  B 2013-01-04     7    NA
     8:  B 2013-01-03     8    NA

结果应该是这样的：

        Cat      Date Data1 Data1_cum
     1:  A 2013-01-08     1    1
     2:  A 2013-01-07     2    2
     3:  A 2013-01-04     3    6
     4:  A 2013-01-03     4    24
     5:  B 2013-01-08     5    5
     6:  B 2013-01-07     6    30
     7:  B 2013-01-04     7    210
     8:  B 2013-01-03     8    1680

我发现我可以使用cumprod() 做类似的事情，但我不知道如何处理这些类别。 Data1 中的 NAs 应被忽略/视为 1。真实数据集大约有 800 万行和 1000 个类别。

【问题讨论】：

你说有 1000 个类别的 8M 条目。这意味着每个类别大约有 8000 个条目。即使最小值是 2，累积乘积也最大是 2^8000，不是吗？你的大部分价值观不都是无穷大吗？
是的，但幸运的是主要有NAs 和大多数小于 1 的数字。

标签： r data.table

【解决方案1】：

如果唯一的外观问题是订购......

DT[order(Date, decreasing=TRUE), Data1_cum := cumprod(Data1), by=Cat]
DT
   Cat       Date Data1 Data1_cum
1:   A 2013-01-08     1         1
2:   A 2013-01-07     2         2
3:   A 2013-01-04     3         6
4:   A 2013-01-03     4        24
5:   B 2013-01-08     5         5
6:   B 2013-01-07     6        30
7:   B 2013-01-04     7       210
8:   B 2013-01-03     8      1680

但是，如果您有 NA 需要处理，那么还有一些额外的步骤：

注意：如果您打乱行的顺序，您的结果可能会有所不同。小心你如何实现order(.) 命令

  ## Let's add some NA values
  DT <- rbind(DT, DT)
  DT[c(2, 6, 11, 15), Data1 := NA]

  # shuffle the rows, to make sure this is right
  set.seed(1)
  DT <- DT[sample(nrow(DT))]

分配累积产品：

离开北美

## If you want to leave the NA's as NA's in the cum prod, use: 
DT[ , Data1_cum := NA_real_ ]
DT[ intersect(order(Date, decreasing=TRUE), which(!is.na(Data1))) 
      , Data1_cum := cumprod(Data1)
      , by=Cat]

# View the data, orderly
DT[order(Date, decreasing=TRUE)][order(Cat)]

     Cat       Date Data1 Data1_cum
  1:   A 2013-01-08     1         1
  2:   A 2013-01-08     1         1
  3:   A 2013-01-07     2         2
  4:   A 2013-01-07    NA        NA  <~~~~~~~  Note that the NA rows have the value of the prev row     
  5:   A 2013-01-04     3         6
  6:   A 2013-01-04    NA        NA  <~~~~~~~  Note that the NA rows have the value of the prev row
  7:   A 2013-01-03     4        24
  8:   A 2013-01-03     4        96
  9:   B 2013-01-08     5         5  
 10:   B 2013-01-08     5        25
 11:   B 2013-01-07     6       150
 12:   B 2013-01-07    NA        NA  <~~~~~~~  Note that the NA rows have the value of the prev row  
 13:   B 2013-01-04     7      1050
 14:   B 2013-01-04    NA        NA  <~~~~~~~  Note that the NA rows have the value of the prev row    
 15:   B 2013-01-03     8      8400
 16:   B 2013-01-03     8     67200

将 NA 替换为前一行的值

## If instead you want to treat the NA's as 1, use: 
DT[order(Date, decreasing=TRUE), Data1_cum := {Data1[is.na(Data1)] <- 1;  cumprod(Data1 [order(Date, decreasing=TRUE)] )}, by=Cat]

# View the data, orderly
DT[order(Date, decreasing=TRUE)][order(Cat)]

    Cat       Date Data1 Data1_cum
 1:   A 2013-01-08     1         1
 2:   A 2013-01-08     1         1
 3:   A 2013-01-07     2         2
 4:   A 2013-01-07    NA         2   <~~~~~~~ Rows with NA took on values of the previous Row
 5:   A 2013-01-04     3         6
 6:   A 2013-01-04    NA         6   <~~~~~~~ Rows with NA took on values of the previous Row
 7:   A 2013-01-03     4        24
 8:   A 2013-01-03     4        96
 9:   B 2013-01-08     5         5
10:   B 2013-01-08     5        25
11:   B 2013-01-07     6       150
12:   B 2013-01-07    NA       150   <~~~~~~~ Rows with NA took on values of the previous Row
13:   B 2013-01-04     7      1050
14:   B 2013-01-04    NA      1050   <~~~~~~~ Rows with NA took on values of the previous Row
15:   B 2013-01-03     8      8400
16:   B 2013-01-03     8     67200

或者，如果您已经拥有累积产品并且只是想删除 NA，您可以执行以下操作：

# fix the NA's with the previous value
DT[order(Date, decreasing=TRUE),
      Data1_cum := {tmp <- c(0, head(Data1_cum, -1));  
      Data1_cum[is.na(Data1_cum)] <- tmp[is.na(Data1_cum)]; 
      Data1_cum }
      , by=Cat ]

【讨论】：

谢谢里卡多和西蒙。我不想在 cum 列中出现 NAs，但我还不确定，您应该选择哪种解决方案来处理 NAs。我认为以类似方式多次使用Data1 时，预先替换它们可能会更有效。
@Cake，替换它们就可以了。问题是订购它们。使用列已排序的玩具样本数据时，结果与列无序时的结果不同。
不幸的是，将密钥设置为date 不会解决这个问题，因为它会是相反的顺序
答案不只是：DT[is.na(Data1), Data1 := 1L][order(Date, decreasing=TRUE), Data1_cum := cumprod(Data1), by=Cat] 还是我错过了什么？
@Arun，如果您要更改原始数据，那么是的。我工作的印象是要保留原始数据