如何矢量化嵌套循环并更新数据框答案

【问题标题】：How to vectorize nested loops and update a dataframe如何矢量化嵌套循环并更新数据框
【发布时间】：2020-12-19 14:59:21
【问题描述】：

我有一个数据框，其中包含一个名为 Product 的列（包含许多产品）、一个名为 Timestamp 的列（表示离散序数变量中的日期）和一个名为 Rating 的列。
我正在尝试计算每个产品的 Rating 变量的移动平均值和移动标准偏差，同时考虑时间戳。

数据看起来像这样：

DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
             Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
             Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))

现在我添加移动平均值和移动标准差的列：

DF$Moving.avg <- rep(0,nrow(DF))
DF$Moving.sd <- rep(0,nrow(DF))

最后，我将这段代码与嵌套的 for 循环一起使用以获得我想要的结果：

for (product in unique(DF$Product)) {
  for (timestamp in DF[DF$Product==product,]$Timestamp){
    if (timestamp==1) {
      DF[DF$Product==product &
           DF$Timestamp==timestamp,]$Moving.avg <- 
        DF[DF$Product==product &
             DF$Timestamp==timestamp,]$Rating
      DF[DF$Product==product &
           DF$Timestamp==timestamp,]$Moving.sd <- 0
    }else{
      index_start <- which(DF$Product==product &
                             DF$Timestamp==1)
      index_end <- which(DF$Product==product &
                           DF$Timestamp==timestamp)
      DF[DF$Product==product &
           DF$Timestamp==timestamp,]$Moving.avg <- 
        mean(DF[index_start:index_end,]$Rating)
  
      DF[DF$Product==product &
           DF$Timestamp==timestamp,]$Moving.sd <- 
        sd(DF[index_start:index_end,]$Rating)
    }
  }
}

代码运行良好，但速度太慢。我想知道如何使用矢量化来加快速度？

【问题讨论】：

标签： r loops vectorization

【解决方案1】：

如果您想在基础 R 中进行矢量化，您可以尝试：

DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
             Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
             Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))

cbind(DF, do.call(rbind, lapply(split(DF, DF$Product), function(x) {
  do.call(rbind, lapply(seq(nrow(x)), function(y) {
    c(Moving.avg = mean(x$Rating[1:y]), Moving.sd = sd(x$Rating[1:y]))}))})))

#>    Product Timestamp Rating Moving.avg Moving.sd
#> 1        a         1      4   4.000000        NA
#> 2        a         2      3   3.500000 0.7071068
#> 3        a         3      5   4.000000 1.0000000
#> 4        a         4      3   3.750000 0.9574271
#> 5        b         1      3   3.000000        NA
#> 6        b         2      4   3.500000 0.7071068
#> 7        b         3      5   4.000000 1.0000000
#> 8        c         1      3   3.000000        NA
#> 9        c         2      1   2.000000 1.4142136
#> 10       c         3      1   1.666667 1.1547005
#> 11       c         4      2   1.750000 0.9574271
#> 12       c         5      5   2.400000 1.6733201

请注意，单个数字的 sd 是 NA 而不是 0。如果需要 DF$Moving.sd[is.na(DF$Moving.sd)] <- 0 替换它们会很简单

^{由reprex package (v0.3.0) 于 2020 年 8 月 31 日创建}

【讨论】：

感谢您的回答，艾伦。我有一个与您的代码有关的问题。您没有考虑变量 Timestamp 来计算脚本中每一行的移动平均值。 R如何知道应该使用什么顺序进行计算？

【解决方案2】：

我认为您正在寻找累积平均值和累积标准差。

对于累积平均值，您可以使用cummean 函数和TTR::runSD 来获取累积标准差。

library(dplyr)

DF %>%
  group_by(Product) %>%
  mutate(cum_avg = cummean(Rating), 
         cum_std = TTR::runSD(Rating, n = 1, cumulative = TRUE))

#  Product Timestamp Rating cum_avg cum_std
#   <chr>       <dbl>  <dbl>   <dbl>   <dbl>
# 1 a               1      4    4    NaN    
# 2 a               2      3    3.5    0.707
# 3 a               3      5    4      1    
# 4 a               4      3    3.75   0.957
# 5 b               1      3    3    NaN    
# 6 b               2      4    3.5    0.707
# 7 b               3      5    4      1    
# 8 c               1      3    3    NaN    
# 9 c               2      1    2      1.41 
#10 c               3      1    1.67   1.15 
#11 c               4      2    1.75   0.957
#12 c               5      5    2.4    1.67

【讨论】：

感谢您的回答，罗纳克。我有同样的问题，我问艾伦，与你更简单的代码有关。您没有考虑变量 Timestamp 来计算脚本中每一行的移动平均值。 R如何知道应该使用什么顺序进行计算？
如果Timestamp 并非总是有序的，您可以先订购数据。在dplyr 中，您可以通过arrange 进行操作，例如DF %>% arrange(Timestamp) %>% group_by(Product).....rest of the code as it is.

【解决方案3】：

这个例子对你有用吗？在这里，我使用 runner 包中的函数 runner() 。 runner() 将在您定义的窗口上应用一个函数，并与 dplyr 的 group_by() 函数一起正常工作。您在 k 参数上定义函数的窗口大小。

library(runner)
library(dplyr)
library(magrittr)

DF <- data.frame(Product=c("a","a","a","a","b","b","b","c","c","c","c","c"),
                 Timestamp=c(1,2,3,4,1,2,3,1,2,3,4,5),
                 Rating=c(4,3,5,3,3,4,5,3,1,1,2,5))


DF <- DF %>% 
  group_by(Product) %>% 
  arrange(Timestamp, .by_group = T)


DF <- DF %>% 
  mutate(
    average = runner(Rating, f = function(x) mean(x), k = 3),
    deviation = runner(Rating, f = function(x) sd(x), k = 3)
  )

值得一提的是，该函数将扩展 data.frame 上每个组（或每个产品）的第一行的窗口大小，直到达到 k 参数中定义的大小。所以在前两行中，我们仍然没有 3 个先前的值，runner() 将在这两行上应用函数。

【讨论】：

【解决方案4】：

在this 对相关问题的回答的基础上，您也可以使用dplyr 这样做：

DF <- DF %>% 
  # Sort in order of product and then timestamp within product 
  arrange(Product, Timestamp) %>% 
  # group data by product
  group_by(Product) %>% 
  # use the cumulative mean function to calculate the means 
  mutate(Moving.avg = cummean(Rating), 
    # use the map_dbl function to calculate standard deviations up to a certain index value       
    Moving.sd = map_dbl(seq_along(Timestamp),~sd(Rating[1:.x])), 
    # replace Moving.sd=0 when Timestamp takes on its smallest value
    Moving.sd = case_when(Timestamp == min(Timestamp) ~ 0, 
                        TRUE ~ Moving.sd)) %>%
  # ungroup the data
  ungroup

【讨论】：