在 R 中使用 for 循环构建洪水度量计算的数据框答案

【问题标题】：Building data frame of flood metric calculations with a for loop in R在 R 中使用 for 循环构建洪水度量计算的数据框
【发布时间】：2021-04-30 10:04:02
【问题描述】：

我有一个名为 all.cols2 的数据集，在 3 年多的时间里，每 20 分钟采集一次 94 个位置的水深。这是一个预览：

 # A tibble: 89,714 x 95
   date_time           Levee.slope      Levee.slope.1      Levee.slope.2    Levee.slope.3
   <dttm>                         <dbl>            <dbl>            <dbl>            <dbl>
 1 2015-12-01 15:05:33           -0.821           -0.539           -0.325          -0.0991
 2 2015-12-01 15:25:33           -0.830           -0.548           -0.334          -0.108 
 3 2015-12-01 15:45:33           -0.829           -0.547           -0.333          -0.107 
 4 2015-12-01 16:05:33           -0.833           -0.551           -0.337          -0.111 
 5 2015-12-01 16:25:33           -0.829           -0.547           -0.333          -0.107 
 6 2015-12-01 16:45:33           -0.834           -0.552           -0.338          -0.112 
 7 2015-12-01 17:05:33           -0.839           -0.557           -0.343          -0.117 
 8 2015-12-01 17:25:33           -0.835           -0.553           -0.339          -0.113 
 9 2015-12-01 17:45:33           -0.826           -0.544           -0.330          -0.104 
10 2015-12-01 18:05:33           -0.804           -0.522           -0.308          -0.0821
# ... with 89,704 more rows, and 90 more variables: Levee.slope.4 <dbl>,

我正在计算每个地点的个别洪水事件的指标。

现在我一直在使用下面的 for 循环一次计算一个位置的这些指标，导出结果并将它们复制并粘贴到一个 Excel 文件中，这需要很长时间。这是我一直在使用的代码：

for (col in 1:length(list.sites)))
  #Label and subset by site  
  site <-  paste0("WaterLevel_",noquote(list.sites[[1]][i])) 
  mut_sub <- all.cols2 %>% select("Date",all_of(site))
  
  # creates binary for positive/negative water level values 
  mut_sub$VarA <- as.integer(mut_sub[,2] > 0) 
  
  # This code is used to label flood events with unique streak_id
  mut_sub <- mut_sub %>% mutate(lagged = lag(VarA))
  mut_sub<-  mut_sub%>% mutate(start = (VarA != lagged)) 
  mut_sub[1, "start"] <- FALSE 
  #filter to keep positive water depths (VarA == 1)
  mut_sub <- mut_sub %>% mutate(streak_id = cumsum(start)) %>%
    filter(VarA == 1)
 
  #calculate mean water depth
  ls <- aggregate(mut_sub[,2], by= list(mut_sub$streak_id), FUN = mean, na.rm = TRUE) 
  
  names(ls)[2] <- "avg_water_depth" 
  
  #calculate max water depth
   MAX <- aggregate(mut_sub[,2], by = list(mut_sub$streak_id), FUN = max, na.rm = TRUE)
   
   names(MAX)[2] <- "max_depth"
  
  #getting length (# of observations) of each event
  obs <- aggregate(mut_sub[,2], by = list(mut_sub$streak_id), FUN = length)
  
  names(obs)[2] <- "observations"
  
  #calculating number of days per event (duration)
  obs <- obs %>%
    mutate(duration_days = (((observations-1)*20)/60)/24)
  
  #Time interval: 
  time <- mut_sub %>% group_by(streak_id) %>% summarise(begin = min(date_time), end = max(date_time))
  time <- time %>% rename(Group.1 = streak_id)
  
  #combine data
  results1 <- inner_join(ls, MAX)
  results2 <- inner_join(results1, obs)
  final <- inner_join(results2, time)

 #way to label sites
  final$site = paste(site, final$Group.1, sep = "_")
}

###...repeat above for each survey point, export and add manually in excel

这给出了如下所示的输出（来自一个站点）：

 Group.1 avg_water_depth   max_depth observations duration_days      begin        end                        site
      1     0.025245673 0.033995673            4    0.04166667 2016-02-09 2016-02-09  WaterLevel_Levee.slope.1_1
      3     0.045995673 0.071995673            8    0.09722222 2016-05-06 2016-05-06  WaterLevel_Levee.slope.1_3
      5     0.003995673 0.005995673            2    0.01388889 2016-05-06 2016-05-06  WaterLevel_Levee.slope.1_5
      7     0.039370673 0.061995673            8    0.09722222 2016-05-07 2016-05-07  WaterLevel_Levee.slope.1_7
      9     0.038785147 0.069995673           19    0.25000000 2016-05-27 2016-05-27  WaterLevel_Levee.slope.1_9
     11     0.063817102 0.110995673           28    0.37500000 2016-05-27 2016-05-28 WaterLevel_Levee.slope.1_11
     13     0.062817102 0.112995673           28    0.37500000 2016-05-28 2016-05-28 WaterLevel_Levee.slope.1_13
     15     0.042495673 0.067995673           18    0.23611111 2016-05-28 2016-05-28 WaterLevel_Levee.slope.1_15

...每个地点的每个洪水事件都有平均水深、最大水深、观测次数、洪水事件的持续时间以及开始和结束的日期/时间。

现在我必须在运行 for 循环之前指定 i，它不会自动通过我的网站。

我的问题是，有没有办法让 for 循环一次遍历所有位置并将其存储在类似于上表的组合输出中？另外，有没有办法压缩循环中的代码，这样我就不必创建这么多数据帧？

【问题讨论】：

这是一个加速：而不是 2 个 if_else，只有一个 all.cols2_sub$VarA <- as.integer(all.cols2_sub$Levee.slope > 0)。它要快得多。但我建议你先分析你的代码，见help('Rprof')。
您可以尝试将以上所有内容包装在一个函数中，然后“并行化”它吗？我不是专家/不确定什么时候最有效，但我过去取得了成功。 rdocumentation.org/packages/parallelize.dynamic/versions/0.9-1/…

标签： r loops dplyr subset data-manipulation

【解决方案1】：

没有一些数据很难展示，但这里有一个使用 foreach 的伪代码，如果你想加快速度，可以使用 doParallel

data <- bind_rows(foreach(location = list_locations) %do% {
  # code handling data for one location
  # ...
  
  # process for each column of one location
  one_location_df <- bind_rows(foreach(i_col=(1:length(data))) %do% {
    # your code handling data
    
    # the final return should be a data_frame even if it is one row data frame
    return(one_result_df)
  })
  
  # some additiona code if has
  # ...
  return(one_location_df)
})

注意：如果使用doParallel，请避免将%dopar% 包裹在另一个%dopar% 周围，否则会导致内存泄漏而无济于事

【讨论】：