R：传递多个参数来累积/减少答案

【问题标题】：R: pass multiple arguments to accumulate/reduceR：传递多个参数来累积/减少
【发布时间】：2021-07-13 05:39:45
【问题描述】：

这与R: use the newly generated data in the previous row有关

我意识到我面临的实际问题比我在上面的线程中给出的示例要复杂一些 - 似乎我必须将 3 个参数传递给递归计算才能实现我想要的。因此，accumulate2 或 reduce 可能不起作用。所以我在这里打开一个新问题以避免可能的混淆。

我有以下按 ID 分组的数据集：

ID <- c(1, 2, 2, 3, 3, 3)
pw <- c(1:6)
add <- c(1, 2, 3, 5, 7, 8)
x <- c(1, 2, NA, 4, NA, NA)
df <- data.frame(ID, pw, add, x)

df
  ID pw add  x
1  1  1   1  1
2  2  2   2  2
3  2  3   3 NA
4  3  4   5  4
5  3  5   7 NA
6  3  6   8 NA

在x 列的每个组中，我想保持第一行的值不变，同时用提高到存储在pw 中的幂的滞后值填充剩余的行，并添加到指数add 中的值。我想在继续进行时更新滞后值。所以我想拥有：

  ID pw add  x
1  1  1   1  1
2  2  2   2  2
3  2  3   3 2^3 + 3
4  3  4   5  4
5  3  5   7 4^5 + 7
6  3  6   8 (4^5 + 7)^6 + 8

我必须将此计算应用于大型数据集，所以如果有一种快速的方法来做到这一点，那就完美了！

【问题讨论】：

do.call(rbind, Reduce(function(x, y)if(is.na(y$x))modifyList(x, list(x=x$x^y$pw+y$add)) else y,split(df, seq(nrow(df))), accumulate = TRUE))
虽然 akrun 已经展示了如何在这里正确解决它，但我仍然曾经在 accumulate 中传递了超过 2 个参数。如果您想知道如何，请参阅this 问题。不过，它也可以通过 for 循环来解决。
@AnilGoyal 非常感谢！这非常有帮助。

标签： r iteration rolling-computation accumulate

【解决方案1】：

在 base R 中，我们可以将以下解决方案用于两个以上的参数。

在此解决方案中，我首先在 ID 值上对原始数据集进行子集化
然后我通过seq_len(nrow(tmp))[-1] 选择行ID 值，省略第一行ID，因为它是由init 提供的
在 Reduce 中使用的匿名函数中，b 参数表示从 init 开始的累积/先前值，c 表示向量的新/当前值，即行号
所以在每次迭代中，我们之前的值（从init 开始）将被提升到pw 的新值的幂，并与add 的新值相加

cbind(df[-length(df)], unlist(lapply(unique(df$ID), function(a) {
  tmp <- subset(df, df$ID == a)
  Reduce(function(b, c) {
    b ^ tmp$pw[c] + tmp$add[c]
  }, init = tmp$x[1],
  seq_len(nrow(tmp))[-1], accumulate = TRUE)
}))) |> setNames(c(names(df)))

  ID pw add            x
1  1  1   1 1.000000e+00
2  2  2   2 2.000000e+00
3  2  3   3 1.100000e+01
4  3  4   5 4.000000e+00
5  3  5   7 1.031000e+03
6  3  6   8 1.201025e+18

数据

structure(list(ID = c(1, 2, 2, 3, 3, 3), pw = 1:6, add = c(1, 
2, 3, 5, 7, 8), x = c(1, 2, NA, 4, NA, NA)), class = "data.frame", row.names = c(NA, 
-6L))

【讨论】：

很棒也很优雅。 +1 已经

【解决方案2】：

Base R，不使用Reduce()，而是使用while()循环：

# Split-apply-combine while loop: res => data.frame
res <- do.call(rbind, lapply(with(df, split(df, ID)), function(y){
  # While there are any NAs in x: 
      while(any(is.na(y$x))){
        # Store the index of the first NA value: idx => integer scalar
        idx <- with(y, head(which(is.na(x)), 1))
        # Calculate x at that index using the business rule provided: 
        # x => numeric vector
        y$x[idx] <- with(y, x[(idx-1)] ** pw[idx] + add[idx])
      }
  # Explicitly define the return object: y => GlobalEnv
     y
    }
  )
)

OR 递归函数：

# Recursive function: estimation_func => function() 
estimation_func <- function(value_vec, exponent_vec, add_vec){
  # Specify the termination condition; when all elements 
  # of value_vec are no longer NA:
  if(all(!(is.na(value_vec)))){
    # Return value_vec: numeric vector => GlobalEnv
    return(value_vec)
  # Otherwise recursively apply the below: 
  }else{
    # Store the index of the first na value: idx => integer vector
    idx <- Position(is.na, value_vec)
    # Calculate the value of the value_vec at that index; 
    # using the provided business logic: value_vec => numeric vector
    value_vec[idx] <- (value_vec[(idx-1)] ** exponent_vec[idx]) + add_vec[idx]
    # Recursively apply function: function => Local Env
    return(estimation_func(value_vec, exponent_vec, add_vec))
  }
}

# Split data.frame into a list on ID; 
# Overwrite x values, applying recursive function;
# Combine list into a data.frame
# res => data.frame
res <- data.frame( 
  do.call(
    rbind, 
    Map(function(y){y$x <- estimation_func(y$x, y$pw, y$add); y}, split(df, df$ID))
  ), row.names = NULL
)

【讨论】：

【解决方案3】：

如果我们想使用accumulate2，那么请正确指定参数，即它需要两个输入参数“pw”和“add”以及一个初始化参数，即“x”的first值。因为它是按'ID'分组的，所以在我们做accumulate2之前进行分组，按顺序分别提取lambda默认参数..1、..2和..3，并基于此创建递归函数

library(dplyr)
library(purrr)
out <- df %>%
   group_by(ID) %>% 
   mutate(x1 = accumulate2(pw[-1], add[-1], ~  ..1^..2 + ..3, 
             .init = first(x)) %>%
                flatten_dbl ) %>%
   ungroup

out$x1
#[1]    1                   2                  11   
#[4]    4                1031 1201024845477409792

如果参数超过 3 个，for 循环会更好

# // initialize an empty vector
out <- c()
# // loop over the `unique` ID
for(id in  unique(df$ID)) {
    # // create a temporary subset of data based on that id
    tmp_df <- subset(df, ID == id)
     # // initialize a temporary storage output
     tmp_out <- numeric(nrow(tmp_df))
     # // initialize first value with the first element of x
     tmp_out[1] <- tmp_df$x[1]
    # // if the number of rows is greater than 1
    if(nrow(tmp_df) > 1) {
       // loop over the rows
      for(i in 2:nrow(tmp_df)) {
        #// do the recursive calculation and update
        tmp_out[i] <- tmp_out[i - 1]^ tmp_df$pw[i] + tmp_df$add[i]
        }
      } 
     
     out <- c(out, tmp_out)

}

out
#[1] 1                   2                  11     
#[4] 4                1031 1201024845477409792

【讨论】：

@AnoushiravanR 这是因为计算的执行方式。如果您检查我的 for 循环 tmp_out[i] <- tmp_out[i - 1]^ tmp_df$pw[i] + tmp_df$add[i] 它正在计算“x”的前一个值的功率，此时“x”的第一个值已经初始化。我们将使用第二个值进行递归
@AnoushiravanR 用accumulate 来理解这个概念有点棘手。我会使用for 循环进行递归，因为它更容易理解并且可以灵活地处理 n 个输入
我理解，因为我刚刚开始了解如何使用它来解决此类问题。感谢您的解释，我想我必须先阅读文档。
@AnoushiravanR 我认为在编码中，最好得到错误，这会触发理解原因，并且当你犯错误并纠正时可能永远不会忘记
我完全同意你的看法。特别是当您无法回答问题时，您可以查看其他贡献者的代码并从中学习。因为有时无论我花多少时间在我的脑海中，这个解决方案都不存在。