【发布时间】:2021-01-31 17:58:13
【问题描述】:
我有一个简单的data.table如下-
ID = c(rep("A", 1000), rep("B", 1000), rep("C", 1000), rep("D", 1000))
val = c("a", "a", "a", "b", "b", "c", "c","d","d","d","d","e","e","f","f","g","g","g","g","g")
dt = data.table(ID, val)
我想在此 data.table 中添加一个新列,该列将按组 ID 延迟 val。
这是预期的输出
> head(dt, 20)
ID val val_lag
1: A a <NA>
2: A a <NA>
3: A a <NA>
4: A b a
5: A b a
6: A c b
7: A c b
8: A d c
9: A d c
10: A d c
11: A d c
12: A e d
13: A e d
14: A f e
15: A f e
16: A g f
17: A g f
18: A g f
19: A g f
20: A g f
我目前使用的解决方案是 -
dt[, val_lag := with(rle(val), rep(c(NA, head(values, -1)), lengths)), by = ID]
但是,此解决方案在实际数据集上非常慢,该数据集非常大并且有数百万行。有没有更快的方法来解决这个问题?
以下是本文讨论的所有方法的性能结果 -
microbenchmark::microbenchmark(rles = dt[, val_lag1 := with(rle(val), rep(c(NA, head(values, -1)), lengths)), by = ID],
chinsoon = dt[, val_lag := shift(val)[nafill(replace(seq.int(.N), rowid(rleid(val)) > 1L, NA_integer_), "locf")], by = ID],
TiC = dt[, val_lag3 := c(NA,rle(val)$values)[cumsum(c(0,head(val,-1)!=tail(val,-1)))+1], by = ID],
times = 1000
)
Unit: milliseconds
expr min lq mean median uq max neval cld
rles 1.549548 1.781014 2.750187 2.096805 2.743668 46.65326 1000 a
chinsoon 1.766827 2.060233 3.059109 2.379477 3.077080 67.16040 1000 a
TiC 1.986808 2.226933 3.472451 2.624236 3.397165 60.67802 1000 b
谢谢!
【问题讨论】:
标签: r data.table