R 使用 data.table 计算依赖于先前行的列答案

【问题标题】：R using data.table to calculate a column dependent on previous rowsR 使用 data.table 计算依赖于先前行的列
【发布时间】：2018-09-30 11:59:30
【问题描述】：

我有多种产品与每日销售额相关。我想根据每种产品的运行累计销售额和我预计在一段时间内的总销售额来预测这些产品的预期每日销售额。

第一个表（“key”）包含每种产品的预期总销售额，以及我根据已售出的数量预测每天的销售量（即，如果我对产品 A 的累计销售额为650，我已经卖出了 1500 总数中的 43%，因此预计第二天会卖出 75，因为 40%

我想根据预测的销量更新每个产品的第二个表（“数据”）累计销量。预测量取决于上一期间的累计销售额，这意味着我无法独立计算每一列，因此我认为我需要使用循环。

但是，我的数据库有超过 500,000 行，而我使用 for 循环的最佳尝试太慢而无法实现。想法？我认为 Rcpp 实现可能是一个潜在的解决方案，但我之前没有使用过那个包或 C++。期望的最终答案如下所示（“final”）。

library(data.table)
key <- data.table(Product = c(rep("A",5), rep("B",5)), TotalSales = 
c(rep(1500,5),rep(750,5)), Percent = rep(seq(0.2, 1, 0.2),2), Forecast = 
c(seq(125, 25, -25), seq(75, 15, -15)))

data <- data.table(Date = rep(seq(1, 9, 1), 2), Product=rep(c("A", "B"), 
each=9L), Time = rep(c(rep("Past",4), rep("Future",5)),2), Sales = c(190, 
165, 133, 120, 0, 0, 0, 0, 0, 72, 58, 63, 51, 0, 0, 0, 0, 0))

final <- data.table(data, Cum = c(190, 355, 488, 608, 683, 758, 833, 908, 
958, 72, 130, 193, 244, 304, 349, 394, 439, 484), Percent.Actual = c(0.13, 
0.24, 0.33, 0.41, 0.46, 0.51, 0.56, 0.61, 0.64, 0.10, 0.17, 0.26, 0.33, 
0.41, 0.47, 0.53, 0.59, 0.65), Forecast = c(0, 0, 0, 0, 75, 75, 75, 75, 50, 
0, 0, 0, 0, 60, 45, 45, 45, 45))

【问题讨论】：

为什么Cum 值从第 10 行重新开始？
请编辑以避免“文字墙”印象。
我认为您的数据和最终表格缺少可以回答 @MKR 问题的 Product 列。
“最终”表合并到先前构建的“数据”表中，该表有一个产品列。每个产品的“Cum”列都会重置。

标签： r performance for-loop data.table rcpp

【解决方案1】：

在给定大小的情况下，不确定这是否真的有助于您的实际数据集。

library(data.table)

#convert key into a list for fast loookup
keyLs <- lapply(split(key, by="Product"), 
    function(x) list(TotalSales=x[,TotalSales[1L]], 
                     Percent=x[,Percent], 
                     Forecast=x[,Forecast]))

#for each product, use recursion to calculate cumulative sales after finding the forecasted sales
futureSales <- data[, {
        byChar <- as.character(.BY)
        list(Date=Date[Time=="Future"], 
            Cum=Reduce(function(x, y) {
                pct <- x / keyLs[[byChar]]$TotalSales
                res <- x + keyLs[[byChar]]$Forecast[findInterval(pct, c(0, keyLs[[byChar]]$Percent))]
                if (res >= keyLs[[byChar]]$TotalSales) return(keyLs[[byChar]]$TotalSales)
                res
            },
            x=rep(0L, sum(Time=="Future")),
            init=sum(Sales[Time=="Past"]),
            accumulate=TRUE)[-1])
    },
    by=.(Product)]
futureSales 

#calculate other sales stats
futureSales[data, on=.(Date, Product)][,
    Cum := ifelse(is.na(Cum), cumsum(Sales), Cum),
    by=.(Product)][,
        ':=' (
            Percent.Actual = Cum / keyLs[[as.character(.BY)]]$TotalSales,
            Forecast = ifelse(Sales > 0, 0, c(0, diff(Cum)))
        ), by=.(Product)][]
#     Product Date Cum   Time Sales Percent.Actual Forecast
#  1:       A    1 190   Past   190      0.1266667        0
#  2:       A    2 355   Past   165      0.2366667        0
#  3:       A    3 488   Past   133      0.3253333        0
#  4:       A    4 608   Past   120      0.4053333        0
#  5:       A    5 683 Future     0      0.4553333       75
#  6:       A    6 758 Future     0      0.5053333       75
#  7:       A    7 833 Future     0      0.5553333       75
#  8:       A    8 908 Future     0      0.6053333       75
#  9:       A    9 958 Future     0      0.6386667       50
# 10:       B    1  72   Past    72      0.0960000        0
# 11:       B    2 130   Past    58      0.1733333        0
# 12:       B    3 193   Past    63      0.2573333        0
# 13:       B    4 244   Past    51      0.3253333        0
# 14:       B    5 304 Future     0      0.4053333       60
# 15:       B    6 349 Future     0      0.4653333       45
# 16:       B    7 394 Future     0      0.5253333       45
# 17:       B    8 439 Future     0      0.5853333       45
# 18:       B    9 484 Future     0      0.6453333       45

您可能还需要考虑按产品并行运行计算。

【讨论】：

这个答案在速度上有了显着的提升，谢谢！但是，如果它可以更快，我会受益匪浅。此外，在将修改后的脚本应用于我的完整数据集时，我遇到了一个特殊问题，即一旦 Cum 达到 TotalSales 限制，第二天的 Forecast 是一个负数，等于所有先前 Future 时间步的总预测销售额。想法？
您将如何处理 A 的累计销售额大于 1500？ OP中没有提到它
好问题。我们假设作为一家销售独特商品的小公司，我们只能在可用产品用完之前销售预测的最大值。因此，一旦我们销售 1500 件产品 A，预测产量将变为 0。