【发布时间】:2021-11-20 16:09:21
【问题描述】:
我在 data_sensor 中有一个包含 106 个小标题的列表,它们是时间序列。每个小标题都有两列,分别是日期和温度。
另一方面,我在 date_admin 中有一个包含 106 个日期的列表,其中包含我希望我的时间序列以 tibble 结尾的日期。
代码可以正常运行,但使用嵌套的 for 循环会花费太多时间,因为每个 tibble 的平均行数接近第 10000 行。
library(tidyverse)
library(dplyr)
#List nesting all the dataframes of all the xls files
files <- dir("C:/User*inals", pattern = "\\.xls$", full.names = TRUE)
data_sensor <- lapply(files, read_xls)
##List nesting all the dataframes of all the xlsx files
filesx <- dir("C:/Us******ls", pattern = "\\.xlsx$", full.names = TRUE)
data_generic <- lapply(filesx, read_xlsx)
idxend=vector()
for (i in seq_along(data_sensor)){
for (j in seq_along(data_sensor[[i]][[1]])){
if (as.Date(data_sensor[[i]][[1]][[j]]) < as.Date(date_admin[i])){
data_sensor[[i]][[1]][[j]] = data_sensor[[i]][[1]][[j]]
} else{ #Convert all the elements after condition to NA's
data_sensor[[i]][[1]][[j]] = NA
data_sensor[[i]][[2]][[j]] = NA
}
}
#Drop all NA's
for (i in seq_along(data_sensor)){
data_sensor[[i]] = drop_na(data_sensor[[i]])
}
}
为了澄清我的小标题和矢量列表:
> data_sensor[[1]][[1]][[1]]
[1] "2018-08-07 11:00:31 UTC"
> data_sensor[[1]][[2]][[1]]
[1] 6.3
> data_sensor[[2]][[1]][[1]]
[1] "2018-08-08 11:56:05 UTC"
#data_sensor[[index of list]][[column of tibble(date,Temperature)]][[row of tibble]]
> date_admin
[1] "2018-10-07 UTC" "2018-12-29 UTC" "2018-12-13 UTC" "2019-08-09 UTC" "2019-10-10 UTC"
[6] "2019-04-26 UTC" "2018-11-21 UTC" "2018-08-23 UTC" "2019-07-08 UTC" "2019-11-19 UTC"
[11] "2019-11-07 UTC" "2018-09-05 UTC" "2018-09-03 UTC" "2018-09-24 UTC" "2018-10-11 UTC"
[16] "2018-09-25 UTC" "2019-03-29 UTC" "2018-08-20 UTC" "2018-09-17 UTC" "2019-03-30 UTC"
[21] "2018-11-07 UTC" "2019-01-01 UTC" "2018-08-31 UTC" "2019-03-27 UTC" "2019-11-10 UTC"
[26] "2019-04-04 UTC" "2019-10-18 UTC" "2018-09-06 UTC" "2018-09-23 UTC" "2018-09-22 UTC"
[31] "2019-07-22 UTC" "2018-09-04 UTC" "2019-05-17 UTC" "2018-11-05 UTC" "2018-12-09 UTC"
[36] "2018-09-03 UTC" "2019-05-21 UTC" "2019-02-22 UTC" "2018-08-30 UTC" "2019-06-04 UTC"
[41] "2018-09-13 UTC" "2018-10-14 UTC" "2019-11-08 UTC" "2018-08-30 UTC" "2019-04-12 UTC"
[46] "2018-09-24 UTC" "2018-08-22 UTC" "2018-08-30 UTC" "2018-09-07 UTC" "2018-11-11 UTC"
[51] "2018-11-01 UTC" "2018-10-01 UTC" "2018-10-22 UTC" "2018-12-03 UTC" "2019-06-06 UTC"
[56] "2018-09-09 UTC" "2018-09-10 UTC" "2018-09-24 UTC" "2018-10-11 UTC" "2018-11-30 UTC"
[61] "2018-09-20 UTC" "2019-11-20 UTC" "2018-10-11 UTC" "2018-10-09 UTC" "2018-09-27 UTC"
[66] "2019-11-11 UTC" "2018-10-04 UTC" "2018-09-14 UTC" "2019-04-27 UTC" "2018-09-04 UTC"
[71] "2018-09-11 UTC" "2018-08-14 UTC" "2018-09-01 UTC" "2018-10-01 UTC" "2018-09-25 UTC"
[76] "2018-09-28 UTC" "2018-09-29 UTC" "2018-10-11 UTC" "2019-03-26 UTC" "2018-10-26 UTC"
[81] "2018-11-21 UTC" "2018-12-02 UTC" "2018-09-08 UTC" "2019-01-08 UTC" "2018-11-07 UTC"
[86] "2019-02-05 UTC" "2019-01-21 UTC" "2018-09-11 UTC" "2018-12-17 UTC" "2019-01-15 UTC"
[91] "2018-08-28 UTC" "2019-01-08 UTC" "2019-05-14 UTC" "2019-01-21 UTC" "2018-11-12 UTC"
[96] "2018-10-26 UTC" "2019-12-26 UTC" "2020-01-03 UTC" "2020-01-06 UTC" "2020-02-26 UTC"
[101] "2020-02-14 UTC" "2020-01-27 UTC" "2020-01-21 UTC" "2020-03-16 UTC" "2020-02-26 UTC"
[106] "2019-12-31 UTC"
data_sensor[[1]]
date Temperature
1 2018-08-07 11:00:31 6.3
2 2018-08-07 11:10:31 11.4
3 2018-08-07 11:20:31 12.0
4 2018-08-07 11:30:31 13.7
5 2018-08-07 11:40:31 15.6
6 2018-08-07 11:50:31 13.6
7 2018-08-07 12:00:31 12.2
8 2018-08-07 12:10:31 11.2
9 2018-08-07 12:20:31 11.6
...............................
...............................
...............................
499 2018-08-10 22:00:31 9.7
500 2018-08-10 22:10:31 9.6
[ reached 'max' / getOption("max.print") -- omitted 8592 rows ]
通过嵌套的 for 循环清理数据需要几分钟时间。如何提高代码的性能?
执行答案时出错:
> data_sensor =
+ tibble(
+ file = paste("file",1:length(date_admin)),
+ date_admin = date_admin
+ ) %>%
+ mutate(data_sensor = map(file, ~data_sensor))
> data_sensor
# A tibble: 106 x 3
file date_admin data_sensor
<chr> <dttm> <list>
1 file 1 2018-10-07 00:00:00 <list [106]>
2 file 2 2018-12-29 00:00:00 <list [106]>
3 file 3 2018-12-13 00:00:00 <list [106]>
在实现代码之前我的data_sensor的类是list,之后变成:
[1] "tbl_df" "tbl" "data.frame"
错误出现在那个块中:
> data_sensor = data_sensor %>%
+ group_by(file) %>%
+ group_modify(~f(.x))
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
> class(data_sensor)
[1] "tbl_df" "tbl" "data.frame"
> data_sensor = data_sensor %>%
+ group_by(file) %>%
+ group_modify(~f(.x))
Error in UseMethod("mutate") :
no applicable method for 'mutate' applied to an object of class "list"
【问题讨论】:
-
您可以尝试与
foreach并行计算每个内部循环。这是一个示例:stackoverflow.com/questions/69362113/… 和 r-bloggers.com/2016/07/… -
谢谢@Skaqqs。我仍在检查如何使它适用于我的情况,但在某些时候我肯定会使用你的建议!
标签: r performance time-series nested-loops tibble