【问题标题】:data.table equivalent of tidyr::complete with group_bydata.table 相当于 tidyr::complete 与 group_by
【发布时间】:2018-04-16 03:46:37
【问题描述】:

我有以下数据框:

library(tidyverse)
df <- data_frame(
  id = c(1, 1, 2, 2), 
  date1 = as.Date(c("2013-01-01", "2013-02-01", "2015-04-01", "2015-05-01")), 
  date2 = as.Date(c("2012-12-09", "2012-12-09", "2015-03-10", "2015-03-10"))
)

# A tibble: 4 x 3
     id      date1      date2
  <dbl>     <date>     <date>
1     1 2013-01-01 2012-12-09
2     1 2013-02-01 2012-12-09
3     2 2015-04-01 2015-03-10
4     2 2015-05-01 2015-03-10

我想完成这个数据框,这样对于每个id,都会有另一个date1 值。这另一个date1 值被计算为下个月。还有一个date2 值对于所有id 都是相同的。使用tidyr::complete,可以这样操作:

df %>% 
  group_by(id) %>% 
  complete(date1 = seq.Date(from = min(date1), length.out = 3, by = "month"), date2 = date2[1])

# A tibble: 6 x 3
# Groups:   id [2]
     id      date1      date2
  <dbl>     <date>     <date>
1     1 2013-01-01 2012-12-09
2     1 2013-02-01 2012-12-09
3     1 2013-03-01 2012-12-09
4     2 2015-04-01 2015-03-10
5     2 2015-05-01 2015-03-10
6     2 2015-06-01 2015-03-10

由于我的原始数据中有大约 150K 组,tidyr 解决方案需要一个多小时才能完成。我假设使用data.table 可以提高速度。 data.table 可以做同样的事情吗?

data.table equivalent of tidyr::complete() 中提出了类似的问题,但没有group_by 子句。

【问题讨论】:

标签: r data.table


【解决方案1】:

基于一些初始基准测试,data.table 方法似乎更快

library(data.table)
setDT(df)[, .(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id]

基准测试

 df <- data_frame(
  id = rep(1:3000, each = 2), 
  date1 = rep(as.Date(c("2013-01-01", "2013-02-01", "2015-04-01", "2015-05-01")),
  length.out = 6000), 
  date2 = rep(as.Date(c("2012-12-09", "2012-12-09", "2015-03-10", "2015-03-10")), 
   length.out = 6000))

system.time({
df %>% 
  group_by(id) %>% 
  complete(date1 = seq.Date(from = min(date1), 
          length.out = 3, by = "month"), date2 = date2[1])
})
#user  system elapsed 
#64.05   21.27   86.05 

system.time({
setDT(df)[, .(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id]
})
#user  system elapsed 
#  0.14    0.00    0.14 

【讨论】:

  • data.table 代码是否等效?好像你会拿一个像你的结果一样的对象,然后离开连接到主表来“完成”..?
  • @Frank 这是一种不同的方式,但system.time({ + setDT(df)[df[,.(date1 = seq(min(date1), length.out = 3, by = 'month'), date2 = date2[1]), id], on = .(id, date1, date2)] + })# user system elapsed 0.31 0.03 0.59 仍然给我更少的时间
【解决方案2】:

如果您需要速度,请尽可能保持精简:

library(data.table)
library(lubridate)

> dt[, .SD
     ][, .(date1=max(date1)), .(id, date2)
     ][, date1Inc := date1 + months(1)
     ][, rbind(dt, .SD[, .(id, date1=date1Inc, date2)])
     ][order(id, date1)
     ]

   id      date1      date2
1:  1 2013-01-01 2012-12-09
2:  1 2013-02-01 2012-12-09
3:  1 2013-03-01 2012-12-09
4:  2 2015-04-01 2015-03-10
5:  2 2015-05-01 2015-03-10
6:  2 2015-06-01 2015-03-10
>   
> 

【讨论】:

    猜你喜欢
    • 2017-09-14
    • 2019-07-01
    • 2016-01-03
    • 1970-01-01
    • 1970-01-01
    • 2021-10-22
    • 2017-11-01
    • 2022-01-06
    相关资源
    最近更新 更多