如何将函数应用于多个数据框答案

【问题标题】：How to apply function to multiple data frames如何将函数应用于多个数据框
【发布时间】：2020-09-04 14:48:26
【问题描述】：

我正在尝试使用包roll 中的roll_lm 函数编写一个在数据框中的多个列上运行滚动回归的函数。然后我想将此函数应用于具有相同格式但不同 x 值的多个数据帧。我正在使用 x = RunTime 进行回归，其余列是单独的 y 值。每个数据帧的 RunTime 值都不同。以下是到目前为止我在数据帧 Sum_9.18 上运行所需函数的内容，其中我有许多其他类似的数据帧。

fun1 <- function(y) {
roll_lm(Sum_9.18$RunTime, y, width = 100)
}
test1 <- data.frame(lapply(Sum_9.18[2:11], fun1))

【问题讨论】：

标签： r function apply

【解决方案1】：

我将给出一个使用lm() 模型的解决方案，因为我没有安装roll 包。

首先，我假设您能够将数据帧存储在列表中。在这种情况下，您可以使用 lapply() 对要应用 fun1() 函数的每个数据框进行迭代。

因此，从概念上讲，我们将有两个由lapply() 调用的不同函数，一个在外层上迭代数据帧，一个在内部级别迭代分析的数据帧的列。内部函数将是您的 fun1() 函数，但为了更清楚起见，我将其称为 fun_inner()。

因此，假设您有一个名为 list_of_dfs 的数据框列表，那么以下函数将允许您执行所需的操作（您只需调整 fun_inner() 函数以调用 roll_lm() 函数）：

#' Fits a predictive model using each column of a data frame as target variable
#'
#' @param df data frame with the data to fit.
#' @param target_columns name or index of the column(s) in data frame \code{df}
#' to be used as target variables, one at a time.
#' @param pred_columns name or index of the column(s) in data frame \code{df}
#' to be used as predictor variables.
#' @param fun_model function to be called that fits the model \code{y ~ X}
#' where \code{y} is each variable taken from the \code{target_columns} of the input data frame
#' and \code{X} is the set of predictor variables taken from the \code{pred_columns}
#' of the input data frame.
#' 
#' @return a list containing the result of the model fit to each target variable.
fun_outer_iterate_on_dfs <- function(df, target_columns, pred_columns, fun_model) {
  lapply(df[, target_columns, drop=FALSE], fun_model, as.matrix(df[, pred_columns, drop=FALSE]))
}

#' Fits an lm( y ~ X ) model. \code{y} is a numeric vector and \code{X} is a matrix.
fun_inner_fit_model <- function(y, X) {
  lm( y ~ X )
}

这是一个用例示例：

set.seed(1717)
nobs = 10
# List containing the data frames to be used in the model fits
list_of_dfs[[1]] = data.frame(RunTime=rnorm(nobs), y1=rnorm(nobs), y2=rnorm(nobs))
list_of_dfs[[2]] = data.frame(RunTime=rnorm(nobs), y1=rnorm(nobs), y2=rnorm(nobs))
list_of_dfs[[3]] = data.frame(RunTime=rnorm(nobs), y1=rnorm(nobs), y2=rnorm(nobs))

test_models_on_each_df <- lapply(list_of_dfs, fun_outer_iterate_on_dfs, c("y1", "y2"),
"RunTime", fun_inner_fit_model)

注意我们如何将更多参数（除了第一个参数）传递给lapply() 调用的函数，只需将它们列在函数名之后（在本例中是在fun_outer_iterate_on_dfs 之后）。

传递给fun_outer_iterate_on_dfs() 函数的列名也可以是列索引。

上面的代码给出了类似的内容：

[[1]]
[[1]]$y1

Call:
lm(formula = y ~ X)

Coefficients:
(Intercept)            X  
   -0.05994     -0.11727  

[[1]]$y2

Call:
lm(formula = y ~ X)

Coefficients:
(Intercept)            X  
    0.02854     -0.08574  

[[2]]
[[2]]$y1

Call:
lm(formula = y ~ X)

Coefficients:
(Intercept)            X  
   -0.23479     -0.01973  

[[2]]$y2

Call:
lm(formula = y ~ X)

Coefficients:
(Intercept)            X  
    0.07248     -0.33088  

[[3]]
[[3]]$y1

Call:
lm(formula = y ~ X)

Coefficients:
(Intercept)            X  
    -0.3087      -0.1191  

[[3]]$y2

Call:
lm(formula = y ~ X)

Coefficients:
(Intercept)            X  
     0.1765       0.5085

我们看到两个回归适合三个数据帧中的每一个，一个在目标变量 y1 上，一个在目标变量 y2 上。

最后，如果您已经将数据框存储为不同的对象，您可以使用以下代码 sn-p 将数据框存储在列表中，假设所有数据框名称都跟在后面模式Sum_* 并且工作区中定义的所有具有此模式的对象都是感兴趣的数据框：

#' Stores objects in the workspace whose name satisfies a given pattern in a list
#' 
#' @param pattern regular expression to be satisfied by the name of the object to store in the list.
#' @return a named list whose elements are the objects found in the parent environment satisfying the given pattern.
#' The name of each element is the name of the object.
store_objects_in_list <- function(pattern) {
  object_names = ls(pattern=pattern, envir=parent.frame())
  list_with_objects = lapply(object_names, get)
  names(list_with_objects) = object_names
  return(list_with_objects)
}
list_of_dfs <- store_objects_in_list("^Sum_")

【讨论】：