脚本优化，主要是由于 for 循环 [R]答案

【问题标题】：Script optimization, primarily due to for loops [R]脚本优化，主要是由于 for 循环 [R]
【发布时间】：2020-07-17 08:13:38
【问题描述】：

我以 1000 Hz 的分辨率记录了血压和速度。在此记录期间，我确定了周期（例如 1、2、3）。主要问题是优化。我有多个长度在 10 到 15 分钟之间的录音，这导致 ~1.000.000 行。

            n    time pres mcav period
1           1   7.000   76 43.6      1
2           2   7.001   75 43.6      1     
3           3   7.002   73 43.6      1     
4           4   7.003   74 43.6      1     
5           5   7.004   74 43.5      1     
6           6   7.005   74 43.5      1     
...
898914 909995 916.994   84 60.8      3   
898915 909996 916.995   85 60.7      3   
898916 909997 916.996   85 60.7      3   
898917 909998 916.997   84 60.6      3   
898918 909999 916.998   83 60.4      3   
898919 910000 916.999   84 60.3      3

对于每个周期，我想识别块（3 秒周期）。

temp <- NULL

#For loop for every period
for(i in unique(df$period)){

 #Extract the part of the df which is within the period
 temp_df <- df[df$time >= min(df$time[df$period == i]) & df$time <= max(df$time[df$period == i]),]

 #Insert "n" starting from 1 and count from there.
 temp_df$block <- temp_df$n-min(temp_df$n)+1

 #Divide this consecutive number into 3-second blocks. 
 temp_df$block <- ceiling(temp_df$block/3000)

 #Combine the dataframes for every period into one.
 temp <- rbind(temp,temp_df[,c("n","block")])
}

这个循环实际上很快，但可以优化。应用和自制功能是否可行？

下一部分可能是问题所在。 temp-dataframe 现在将被合并：

df <- merge(df,temp,by="n",all.x=T)

这部分需要几秒钟，但它会产生这个数据框：

            n    time pres mcav period block
1           1   7.000   76 43.6      1     1
2           2   7.001   75 43.6      1     1
3           3   7.002   73 43.6      1     1
4           4   7.003   74 43.6      1     1
5           5   7.004   74 43.5      1     1
6           6   7.005   74 43.5      1     1
...
898914 909995 916.994   84 60.8      3   100
898915 909996 916.995   85 60.7      3   100
898916 909997 916.996   85 60.7      3   100
898917 909998 916.997   84 60.6      3   100
898918 909999 916.998   83 60.4      3   100
898919 910000 916.999   84 60.3      3   100

【问题讨论】：

您可以使用dplyr 或特别是data.table 进行很多优化。您可以dput()您的示例数据而不是发布打印输出以使其可重现吗？
我真的更喜欢使用 base R，但如果优化仅适用于其他包，我接受。 1 mio 行，无法导出到dput()
但是您仍然可以通过选择几行然后dput 使其具有最低限度的可重复性，对吗？

标签： r for-loop optimization

【解决方案1】：

我想你可以试试ave，如下所示

df <- within(df,block <- ave(n,period,FUN = function(x) ceiling(x/3000)))

【讨论】：

我喜欢这个想法，我会尝试实现它。我以前没有用过ave，但是这个mig对我有帮助。

【解决方案2】：

是的，合并/绑定很慢。所以让我们完全避免它们。

# Generate some test data
library(tidyverse)
df <- tibble(time=seq(7, 918, 0.001), period=rep(1:3, each=303667), pres=rnorm(911001), mcav=rnorm(911001))

# Get the period start times
periodStart <- df %>% 
                  group_by(period) %>% 
                  slice_min(time) %>% 
                  select(period, time) %>% 
                  rename(baseTime=time)
# Merge the period start times with the original dataset and derive the block definitions
# (The +0.0000001 is to make sure the first obs in each block is handled correctly.  
# Any value will do so long as it's less than your sampling frequency)
answer <- df %>% 
             left_join(periodStart, by="period") %>% 
             mutate(block=ceiling((time-baseTime + 0.0000001)/3))
answer
# A tibble: 911,001 x 6
    time period    pres   mcav baseTime block
   <dbl>  <int>   <dbl>  <dbl>    <dbl> <dbl>
 1  7         1 -0.929   0.571        7     1
 2  7.00      1 -1.47   -0.304        7     1
 3  7.00      1  0.0150  0.193        7     1
 4  7.00      1 -1.12   -0.595        7     1
 5  7.00      1  0.677  -0.571        7     1
 6  7.00      1  1.07    2.31         7     1
 7  7.01      1 -0.606  -1.16         7     1
 8  7.01      1  1.51    0.170        7     1
 9  7.01      1 -0.718  -1.76         7     1
10  7.01      1 -0.785  -0.606        7     1
# … with 910,991 more rows

现在让我们看看它有多快

library(microbenchmark)
microbenchmark({
  periodStart <- df %>% group_by(period) %>% slice_min(time) %>% select(period, time) %>% rename(baseTime=time)
  answer <- df %>% left_join(periodStart, by="period") %>% mutate(block=ceiling((time-baseTime + 0.0000001)/3))
})
      min       lq     mean   median       uq      max neval
 334.9331 353.9955 424.4326 364.4362 548.3301 599.6329   100

所以平均只有 0.4 秒多一点。够快吗？

【讨论】：

我对此进行了测试，速度要快得多，但问题是我更喜欢 base R。我会回复你的。谢谢！