【发布时间】:2019-03-27 15:03:14
【问题描述】:
我有一个数据框,其中每一行代表时间序列的一部分。
我需要创建一个跨越数年、最多数百个单位的总时间序列。
因此,每一行设置一个特定时期的值,然后它需要恢复到最大给定值(由 maks 给出)。
参见此处的示例:
代码:
library(tidyr)
library(dplyr)
# My data for 3 units
df <- structure(list(Unit = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L,
3L, 3L), .Label = c("A", "B", "C"), class = "factor"), Limit = c(850L,
655L, 500L, 1000L, 100L, 75L, 0L, 600L, 635L), Max = c(1310L,
1310L, 1310L, 1300L, 1300L, 1300L, 915L, 915L, 915L), startDate = structure(c(1483250400,
1430481600, 1546286400, 1421280000, 1498813200, 1546300800, 1420869600,
1527876000, 1463097600), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
endDate = structure(c(1496275200, 1451520000, 1609459200,
1426431600, 1527811200, 1577836800, 1433170800, 1546383600,
1464807600), class = c("POSIXct", "POSIXt"), tzone = "UTC")), row.names = c(NA,
-9L), class = "data.frame")
# Doing a loop to create time series for each row
d <- NULL
for(i in 1:nrow(df)) {
d <- rbind(d, data.frame(Date = seq.POSIXt(from = df$startDate[i], to = df$endDate[i], by = "hour"),
value = df$Limit[i],
unit = df$Unit[i]))
}
# Spread it out to a nice data frame
d <- spread(d, unit, value = value)
# Left join on a global time series
globalStart <- as.POSIXct("2015-01-01 00:00:00", tz = "UTC")
globalEnd <- as.POSIXct("2021-12-01 00:00:00", tz = "UTC")
dfResult <- data.frame(Date = seq.POSIXt(from = globalStart, to = globalEnd, by = "hour"))
# Now join it together
dfResult <- left_join(dfResult, d, by = "Date")
# Add values to fill out NA with max
maks <- c(915, 1315, 900)
dfResult[is.na(dfResult[, 2]), 2] <- maks[1]
dfResult[is.na(dfResult[, 3]), 3] <- maks[2]
dfResult[is.na(dfResult[, 4]), 4] <- maks[3]
# Final result
dfResult
我的问题是,我的数据集大约需要 35 分钟,而这只有 58 个单位,我可能需要为数千个单位做这件事 - 我需要大大加快速度。
【问题讨论】:
-
两种方法都试过了吗,看看哪种方法最快?
-
是的,我偏爱base R ;)
-
data.table 也可以使用: setDT(df)[ , list(Unit = Unit, Limit = Limit, Max = Max, Date = seq(startDate, endDate, by = "hour" )), by = 1:nrow(df)]
-
是的,我知道有很多方法可以做到这一点。只是好奇什么是最快的。
标签: r performance for-loop time