在 data.table 中沿组成员分配答案

【问题标题】：Allocate along members of a group in data.table在 data.table 中沿组成员分配
【发布时间】：2016-04-19 05:56:59
【问题描述】：

我有一个看起来像这样的需求表：

set.seed(1)
DTd <- data.table(loc="L1", product="P1", cust=c("C1","C2","C3"), period=c("per1","per2","per3","per4"), qty=runif(12,min=0,max=100), key=c("loc","product","cust","period"))
DTd[]
#   loc product cust period      qty
#1:  L1      P1   C1   per1 12.97134
#2:  L1      P1   C1   per2 65.37663
#3:  L1      P1   C1   per3 34.21633
#4:  L1      P1   C1   per4 24.23550
#5:  L1      P1   C2   per1 85.68853
#6:  L1      P1   C2   per2 98.22407
#7:  L1      P1   C2   per3 92.24086
#8:  L1      P1   C2   per4 70.62672
#9:  L1      P1   C3   per1 62.12432
#10:  L1      P1   C3   per2 84.08788
#11:  L1      P1   C3   per3 82.67184
#12:  L1      P1   C3   per4 53.63538

还有一个如下所示的供应表：

DTs <- data.table(loc="L1", product="P1", period=c("per1","per2","per3","per4"), qty=runif(4,min=0,max=200), key=c("loc","product","period"))
DTs[]
#   loc product period       qty
#1:  L1      P1   per1   9.23293
#2:  L1      P1   per2  74.03622
#3:  L1      P1   per3 133.54770
#4:  L1      P1   per4 123.43913

我需要按优先级将供应分配给相应的需求，并在需求表中添加“已分配”列。出于本示例的目的，我们将假设优先级是按最小需求优先。

这就是我正在寻找的结果。

#loc product cust period      qty     alloc
#1:  L1      P1   C1   per1 12.97134  9.232930
#2:  L1      P1   C1   per2 65.37663 65.376625
#3:  L1      P1   C1   per3 34.21633 34.216329
#4:  L1      P1   C1   per4 24.23550 24.235499
#5:  L1      P1   C2   per1 85.68853  0.000000
#6:  L1      P1   C2   per2 98.22407  0.000000
#7:  L1      P1   C2   per3 92.24086 16.659531
#8:  L1      P1   C2   per4 70.62672 45.568249
#9:  L1      P1   C3   per1 62.12432  0.000000
#10:  L1      P1   C3   per2 84.08788  8.659591
#11:  L1      P1   C3   per3 82.67184 82.671841
#12:  L1      P1   C3   per4 53.63538 53.635379

我看不到使用 data.table 的功能有效地做到这一点的方法。我似乎被简化为循环遍历行并逐行使用 set 进行更新。这是我在本例中使用的代码。

#set key on demand to match supply and order by the qty (for prioritising
setkey(DTd, loc, product, period, qty)
#add a column for the allocated quantity
DTd[,alloc:=0]
#loop through the rows of the supply, using the row number
for (s in DTs[, .I]) {
    key <- DTs[s, .(loc, product, period)]
    suppqty <- DTs[s, qty]
    #loop through the corresponding demand and return the row number
    for (d in DTd[key, which=TRUE]) {
        if (suppqty == 0) break
        #determine the quantity to allocate from the demand row
        allocqty <- DTd[d, ifelse(qty < suppqty, qty, suppqty)]
        #update the alloc qty on this row
        set(DTd, d, 6L, allocqty)
        #reduce the amount outstanding
        suppqty <- suppqty - allocqty
    }
}
#restore the original keys
setkey(DTd, loc, product, cust, period)

非常感谢任何有关实现其中任何部分的更好方法的建议。（实际上，表格非常大，优先级规则可能非常复杂，但在这种情况下，我会先执行一遍以确定优先级，然后在分配通道中使用它）。

【问题讨论】：

仅供参考，ifelse(qty < suppqty, qty, suppqty) 可以计算为pmin(suppqty, qty)。

标签： r data.table

【解决方案1】：

你可以的

setnames(DTs, "qty", "suppqty")
setnames(DTd, "qty", "demqty")
setorder(DTd, loc, product, period, demqty) # put your priority column last here

DTd[DTs, alloc := {
  resid_supply = shift(pmax(suppqty - cumsum(demqty), 0), fill=suppqty[1L])
  pmin(demqty, resid_supply)
}, by=.EACHI, on=c("loc", "product", "period")]

结果是

    loc product cust period    demqty     alloc
 1:  L1      P1   C2   per1 20.168193 20.168193
 2:  L1      P1   C1   per1 26.550866 26.550866
 3:  L1      P1   C3   per1 62.911404 62.911404
 4:  L1      P1   C1   per2  6.178627  6.178627
 5:  L1      P1   C2   per2 37.212390 37.212390
 6:  L1      P1   C3   per2 89.838968 33.429727
 7:  L1      P1   C2   per3 20.597457 20.597457
 8:  L1      P1   C3   per3 57.285336 57.285336
 9:  L1      P1   C1   per3 94.467527 76.085490
10:  L1      P1   C3   per4 17.655675 17.655675
11:  L1      P1   C2   per4 66.079779 66.079779
12:  L1      P1   C1   per4 90.820779 15.804394

这些天，您通常不需要在合并之前设置密钥，正如其中一位软件包作者 Arun, in this SO post 所述：

因此，在大多数情况下，不再需要设置键。我们建议尽可能使用on=，除非设置密钥可以显着提高您想要利用的性能。

对于类似的计算（按最低价格优先采购），you can see my other answer。

【讨论】：

太棒了！谢谢@弗兰克。这教会了我两个新东西：1. shift（我已经编写了自己的版本） 2. 在赋值的 {} 中使用多个语句。当我申请更大的表时，我会发布一些性能比较。
不是一个精确的比较，但是对于一个包含 9723 个要求乘以 13 个周期的表（melt_ed 格式的 120383 行）：_For 循环和 set 77.3秒用户时间。 @Frank 构造了 4.64 秒的用户时间。快 16 倍以上。