【问题标题】:Creating time series with loops and populating them with values使用循环创建时间序列并用值填充它们
【发布时间】:2019-03-27 15:03:14
【问题描述】:

我有一个数据框,其中每一行代表时间序列的一部分。

我需要创建一个跨越数年、最多数百个单位的总时间序列。

因此,每一行设置一个特定时期的值,然后它需要恢复到最大给定值(由 maks 给出)。

参见此处的示例:

代码:

library(tidyr)
library(dplyr)

# My data for 3 units
df <- structure(list(Unit = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 
3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Limit = c(850L, 
655L, 500L, 1000L, 100L, 75L, 0L, 600L, 635L), Max = c(1310L, 
1310L, 1310L, 1300L, 1300L, 1300L, 915L, 915L, 915L), startDate = structure(c(1483250400, 
1430481600, 1546286400, 1421280000, 1498813200, 1546300800, 1420869600, 
1527876000, 1463097600), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
endDate = structure(c(1496275200, 1451520000, 1609459200, 
1426431600, 1527811200, 1577836800, 1433170800, 1546383600, 
1464807600), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA, 
-9L), class = "data.frame")


# Doing a loop to create time series for each row
d <- NULL

for(i in 1:nrow(df)) {
  d <- rbind(d, data.frame(Date = seq.POSIXt(from = df$startDate[i], to = df$endDate[i], by = "hour"), 
                           value = df$Limit[i], 
                           unit = df$Unit[i]))

}

# Spread it out to a nice data frame
d <- spread(d, unit, value = value)

# Left join on a global time series
globalStart <- as.POSIXct("2015-01-01 00:00:00", tz = "UTC")
globalEnd <- as.POSIXct("2021-12-01 00:00:00", tz = "UTC")

dfResult <- data.frame(Date = seq.POSIXt(from = globalStart, to = globalEnd, by = "hour"))

# Now join it together
dfResult <- left_join(dfResult, d, by = "Date")

# Add values to fill out NA with max
maks <- c(915, 1315, 900)

dfResult[is.na(dfResult[, 2]), 2] <- maks[1]
dfResult[is.na(dfResult[, 3]), 3] <- maks[2]
dfResult[is.na(dfResult[, 4]), 4] <- maks[3]

# Final result
dfResult

我的问题是,我的数据集大约需要 35 分钟,而这只有 58 个单位,我可能需要为数千个单位做这件事 - 我需要大大加快速度。

【问题讨论】:

  • 两种方法都试过了吗,看看哪种方法最快?
  • 是的,我偏爱base R ;)
  • data.table 也可以使用: setDT(df)[ , list(Unit = Unit, Limit = Limit, Max = Max, Date = seq(startDate, endDate, by = "hour" )), by = 1:nrow(df)]
  • 是的,我知道有很多方法可以做到这一点。只是好奇什么是最快的。

标签: r performance for-loop time


【解决方案1】:

一种选择是用以下代码替换您的循环:

# creates date list:
dateList <- lapply(1:nrow(df), function(x) {
  seq.POSIXt(df$startDate[x], df$endDate[x], by = "hour")
  })
llengths <- lengths(dateList) # lengths of date vectors
# combine everything together:
d <- data.frame(Date = do.call("c", dateList),
                value = rep(df$Limit, llengths),
                unit = rep(df$Unit, llengths))

您的代码的主要问题是rbind 的使用。尽量避免它,尤其是在循环中。

【讨论】:

    【解决方案2】:

    可以尝试最小化步骤数并使用data.table

    不用扩展原来的data.frame,你可以用foverlaps()加入整个时间线,然后用dcast()传播。然后只需通过引用 NA 值来更新。

    library(data.table)
    
    df <- structure(list(Unit = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"),
                         Limit = c(850L, 655L, 500L, 1000L, 100L, 75L, 0L, 600L, 635L), Max = c(1310L, 1310L, 1310L, 1300L, 1300L, 1300L, 915L, 915L, 915L),
                         startDate = structure(c(1483250400, 1430481600, 1546286400, 1421280000, 1498813200, 1546300800, 1420869600, 1527876000, 1463097600), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
                         endDate = structure(c(1496275200, 1451520000, 1609459200, 1426431600, 1527811200, 1577836800, 1433170800, 1546383600, 1464807600), class = c("POSIXct", "POSIXt"), tzone = "UTC")),
                    row.names = c(NA, -9L), class = "data.frame")
    
    # use data.table with key
    dataset <- as.data.table(df)
    setkey(dataset, "startDate", "endDate")
    
    # start - end for whole timeline
    globalStart <- as.POSIXct("2015-01-01 00:00:00", tz = "UTC")
    globalEnd <- as.POSIXct("2021-12-01 00:00:00", tz = "UTC")
    
    # data.table of start and end for each period, used with `foverlaps()`
    timeline_dt <- data.table(Dttm = seq.POSIXt(globalStart, globalEnd, "hour"),
                              endDate = seq.POSIXt(globalStart, globalEnd, "hour") + 3599L)
    # not required to be keyed but may as well
    setkey(timeline_dt, "Dttm", "endDate")
    
    # join by period overlapping, retain columns of interest
    dataset_result <- foverlaps(timeline_dt, dataset)[, .(Dttm, Unit, Limit)]
    # same as spread, drop NA column
    dataset_result <- dcast(dataset_result, Dttm ~ Unit, value.var = "Limit")[, -"NA"]
    
    # iterate over maks, and update by reference 
    # i + 1L because we skip first column which is Dttm
    maks <- c(915, 1315, 900)
    for (i in seq_along(maks)) {
      set(x = dataset_result,
          i = which(is.na(dataset_result[[i+1L]])),
          j = i+1L,
          value = maks[i])
    }
    

    【讨论】:

      猜你喜欢
      • 2021-12-10
      • 2023-03-06
      • 2017-02-26
      • 1970-01-01
      • 2014-08-28
      • 1970-01-01
      • 2020-05-11
      • 2021-05-20
      • 2023-01-12
      相关资源
      最近更新 更多