【发布时间】:2021-04-30 10:04:02
【问题描述】:
我有一个名为 all.cols2 的数据集,在 3 年多的时间里,每 20 分钟采集一次 94 个位置的水深。这是一个预览:
# A tibble: 89,714 x 95
date_time Levee.slope Levee.slope.1 Levee.slope.2 Levee.slope.3
<dttm> <dbl> <dbl> <dbl> <dbl>
1 2015-12-01 15:05:33 -0.821 -0.539 -0.325 -0.0991
2 2015-12-01 15:25:33 -0.830 -0.548 -0.334 -0.108
3 2015-12-01 15:45:33 -0.829 -0.547 -0.333 -0.107
4 2015-12-01 16:05:33 -0.833 -0.551 -0.337 -0.111
5 2015-12-01 16:25:33 -0.829 -0.547 -0.333 -0.107
6 2015-12-01 16:45:33 -0.834 -0.552 -0.338 -0.112
7 2015-12-01 17:05:33 -0.839 -0.557 -0.343 -0.117
8 2015-12-01 17:25:33 -0.835 -0.553 -0.339 -0.113
9 2015-12-01 17:45:33 -0.826 -0.544 -0.330 -0.104
10 2015-12-01 18:05:33 -0.804 -0.522 -0.308 -0.0821
# ... with 89,704 more rows, and 90 more variables: Levee.slope.4 <dbl>,
我正在计算每个地点的个别洪水事件的指标。
现在我一直在使用下面的 for 循环一次计算一个位置的这些指标,导出结果并将它们复制并粘贴到一个 Excel 文件中,这需要很长时间。这是我一直在使用的代码:
for (col in 1:length(list.sites)))
#Label and subset by site
site <- paste0("WaterLevel_",noquote(list.sites[[1]][i]))
mut_sub <- all.cols2 %>% select("Date",all_of(site))
# creates binary for positive/negative water level values
mut_sub$VarA <- as.integer(mut_sub[,2] > 0)
# This code is used to label flood events with unique streak_id
mut_sub <- mut_sub %>% mutate(lagged = lag(VarA))
mut_sub<- mut_sub%>% mutate(start = (VarA != lagged))
mut_sub[1, "start"] <- FALSE
#filter to keep positive water depths (VarA == 1)
mut_sub <- mut_sub %>% mutate(streak_id = cumsum(start)) %>%
filter(VarA == 1)
#calculate mean water depth
ls <- aggregate(mut_sub[,2], by= list(mut_sub$streak_id), FUN = mean, na.rm = TRUE)
names(ls)[2] <- "avg_water_depth"
#calculate max water depth
MAX <- aggregate(mut_sub[,2], by = list(mut_sub$streak_id), FUN = max, na.rm = TRUE)
names(MAX)[2] <- "max_depth"
#getting length (# of observations) of each event
obs <- aggregate(mut_sub[,2], by = list(mut_sub$streak_id), FUN = length)
names(obs)[2] <- "observations"
#calculating number of days per event (duration)
obs <- obs %>%
mutate(duration_days = (((observations-1)*20)/60)/24)
#Time interval:
time <- mut_sub %>% group_by(streak_id) %>% summarise(begin = min(date_time), end = max(date_time))
time <- time %>% rename(Group.1 = streak_id)
#combine data
results1 <- inner_join(ls, MAX)
results2 <- inner_join(results1, obs)
final <- inner_join(results2, time)
#way to label sites
final$site = paste(site, final$Group.1, sep = "_")
}
###...repeat above for each survey point, export and add manually in excel
这给出了如下所示的输出(来自一个站点):
Group.1 avg_water_depth max_depth observations duration_days begin end site
1 0.025245673 0.033995673 4 0.04166667 2016-02-09 2016-02-09 WaterLevel_Levee.slope.1_1
3 0.045995673 0.071995673 8 0.09722222 2016-05-06 2016-05-06 WaterLevel_Levee.slope.1_3
5 0.003995673 0.005995673 2 0.01388889 2016-05-06 2016-05-06 WaterLevel_Levee.slope.1_5
7 0.039370673 0.061995673 8 0.09722222 2016-05-07 2016-05-07 WaterLevel_Levee.slope.1_7
9 0.038785147 0.069995673 19 0.25000000 2016-05-27 2016-05-27 WaterLevel_Levee.slope.1_9
11 0.063817102 0.110995673 28 0.37500000 2016-05-27 2016-05-28 WaterLevel_Levee.slope.1_11
13 0.062817102 0.112995673 28 0.37500000 2016-05-28 2016-05-28 WaterLevel_Levee.slope.1_13
15 0.042495673 0.067995673 18 0.23611111 2016-05-28 2016-05-28 WaterLevel_Levee.slope.1_15
...每个地点的每个洪水事件都有平均水深、最大水深、观测次数、洪水事件的持续时间以及开始和结束的日期/时间。
现在我必须在运行 for 循环之前指定 i,它不会自动通过我的网站。
我的问题是,有没有办法让 for 循环一次遍历所有位置并将其存储在类似于上表的组合输出中?另外,有没有办法压缩循环中的代码,这样我就不必创建这么多数据帧?
【问题讨论】:
-
这是一个加速:而不是 2 个
if_else,只有一个all.cols2_sub$VarA <- as.integer(all.cols2_sub$Levee.slope > 0)。它要快得多。但我建议你先分析你的代码,见help('Rprof')。 -
您可以尝试将以上所有内容包装在一个函数中,然后“并行化”它吗?我不是专家/不确定什么时候最有效,但我过去取得了成功。 rdocumentation.org/packages/parallelize.dynamic/versions/0.9-1/…
标签: r loops dplyr subset data-manipulation