强制使用带有 group_by 和 mutate() 的 for 循环答案

【问题标题】：Forcing the use of a for loop with group_by and mutate()强制使用带有 group_by 和 mutate() 的 for 循环
【发布时间】：2018-08-31 19:17:47
【问题描述】：

我有一个数据帧列表（由初始数据帧的排列顺序生成），我想使用group_by_at() 和mutate() 对其应用复杂的微积分。它适用于单个数据框，但使用for 循环失败，因为mutate 需要数据框的名称和我的一些微积分。所以我想，好吧，让我们创建一个所有具有相同名称的不同数据帧的列表，并循环遍历最初的名称序列。不幸的是，这个技巧不起作用，我收到以下消息：

Error: object of type 'closure' is not subsettable.

这是显示我所有步骤的自包含示例。我认为问题来自mutate。那么，我怎样才能强制使用for loop 和mutate？

data <- read.table(text = 'obs  gender   ageclass    weight   year   subdata   income    
                        1     F         1         10     yearA     sub1   1000   
                        2     M         2         25     yearA     sub1   1200   
                        3     M         2          5     yearB     sub2   1400   
                        4     M         1         11     yearB     sub1   1350',
 header = TRUE)  


library(dplyr)
library(GiniWegNeg)

dataA <- select(data, gender, ageclass)
dataB <- select(data, -gender, -ageclass)
rm(data)

# Generate permutation of indexes based on the number of column in dataA
library(combinat)
index <- permn(ncol(dataA))

# Attach dataA to the previous list of index           
res <- lapply(index, function(x) dataA[x])

# name my list keeping track of permutation order in dataframe name
names(res) <- unlist(lapply(res,function(x) sprintf('data%s',paste0(toupper(substr(colnames(x),1,1)),collapse = ''))))

# Create a list containing the name of each data.frame name
NameList <- unlist(lapply(res,function(x) sprintf('data%s',paste0(toupper(substr(colnames(x),1,1)),collapse = ''))))

# Define as N the number of columns/permutation/dataframes
N <- length(res)

# Merge res and dataB for all permutation of dataframes
res <- lapply(res,function(x) cbind(x,dataB))

# Change the name of res so that all data frames are named data
names(res) <- rep("data", N)


# APPLY FOR LOOP TO ALL DATAFRAMES

for (j in NameList){

runCalc <- function(data, y){ 

  data <- data %>% 
    group_by_at(1) %>% 
    mutate(Income_1 = weighted.mean(income, weight))
  data <- data %>% 
    group_by_at(2) %>% 
    mutate(Income_2 = weighted.mean(income, weight))      

  gini <- c(Gini_RSV(data$Income_1, data$weight), Gini_RSV(data$Income_2,data$weight))

  Gini <- data.frame(gini)
  colnames(Gini) <- c("Income_1","Income_2")
  rownames(Gini) <- c(paste0("Gini_", y))

  return(Gini)
}

runOtherCalc <- function(df, y){
  Contrib <- (1/5) * df$Income_1 + df$Income_2
  Contrib <- data.frame(Contrib)
  colnames(Contrib) <- c("myresult")
  rownames(Contrib) <- c(paste0("Contrib_", y)

  return(Contrib)
}

# Run runCalc over dataframe data by year

df1_List <- lapply(unique(data$year), function(i) {      
  byperiod <- subset(data, year == i)
  runCalc(byperiod, i)      
})

# runCalc returns df which then passes to runOtherCalc, again by year

df1_OtherList <- lapply(unique(data$year), function(i)     
  byperiod <- subset(data, year == i)
  df <- runCalc(byperiod, i) 
  runOtherCalc(df, i)      
})

# Run runCalc over dataframe data by subdata

df2_List <- lapply(unique(data$subdata), function(i) {      
  byperiod <- subset(data, subdata == i)
  runCalc(bysubdata, i)      
})

# runCalc returns df which then passes to runOtherCalc, again by subdata

df2_OtherList <- lapply(unique(data$subdata), function(i)     
  bysubdata <- subset(data, subdata == i)
  df <- runCalc(bysubdata, i) 
  runOtherCalc(df, i)      
})


# Return all results in separate frames, then append by row in 2 frames

Gini_df1 <- do.call(rbind, df1_List)
Contrib_df1 <- do.call(rbind,df1_OtherList)
Gini_df2 <- do.call(rbind, df1_List)
Contrib_df2 <- do.call(rbind,df1_OtherList)

Gini <- rbind(Gini_df1, Gini_df2)
Contrib <- rbind(Contrib_df1, Contrib_df2)


}

【问题讨论】：

您在 dplyr 管道中缺少的东西是 purrr::map。我可以建议您在等待下面发布的潜在答案时查看 Hadley 的这段视频，解释如何准确解决这个问题：youtube.com/watch?v=rz3_FDVt9eg 此处幻灯片：speakerdeck.com/hadley/managing-many-models
我做到了（并且在纸杯蛋糕示例中遇到了困难）。 purrr::map() 是一个函数，用于将函数应用于列表的每个元素。我不知道这个功能。对于我的特定示例，不幸的是我并不完全知道如何计算它。作为一个新手，我想到了data <-map(datalist)，其中datalist 是我的数据框列表，但我不明白如何返回结果。
没有在你的for 循环中使用 j 变量。

标签： r list for-loop dataframe dplyr

【解决方案1】：

诚然，您在下面收到的 R 错误有点神秘，但通常这意味着您正在对不存在的对象运行操作。

错误：“闭包”类型的对象不是子集。

具体来说，它与您的 lapply 调用一起提供，因为 data 未在全局任何地方定义（仅在 runCalc 方法中），如上所述，您可以使用 @987654323 将其删除@。

dfList <- lapply(unique(data$year), function(i) {      
  byperiod <- subset(data, year == i)
  runCalc(byperiod, i)      
})

通过，lapply...unique...subset 的使用方式可以替换为未充分使用的分组基 R 函数 by()。

从您的文本和代码中收集，我相信您打算对列表中的每个数据框res 运行一个年分组。然后考虑两个by 调用，它们封装在一个更大的函数中，该函数接收一个数据帧df 作为参数。然后在所有列表项上运行 lapply 以返回嵌套数据帧对的新列表。

# SECONDARY FUNCTIONS
runCalc <- function(data) {                                    
  data <- data %>% 
    group_by_at(1) %>% 
    mutate(Income_1 = weighted.mean(income, weight))
  data <- data %>% 
    group_by_at(2) %>% 
    mutate(Income_2 = weighted.mean(income, weight))      

  Gini <- data.frame(
              year = data$year[[1]],
              Income_1 = unname(Gini_RSV(data$Income_1, data$weight)), 
              Income_2 = unname(Gini_RSV(data$Income_2, data$weight)),
              row.names = paste0("Gini_", data$year[[1]])
          )

  return(Gini)
}

runOtherCalc <- function(df){
  Contrib <- data.frame(
                 myresult = (1/5) * df$Income_1 + df$Income_2,
                 row.names = paste0("Contrib_", df$year[[1]])
             )
  return(Contrib)
}

# PRIMARY FUNCTION
runDfOperations <- function(df) {   
  gList <- by(df, df$year, runCalc)     
  gTmp <- do.call(rbind, gList)

  cList <- by(gTmp, gTmp$year, runOtherCalc)
  cTmp <- do.call(rbind, cList)

  gtmp$year <- NULL
  return(list(gTmp, cTmp))
}

# RETURNS NESTED LIST OF TWO DFs FOR EACH ORIGINAL DF
new_res <- lapply(res, runDfOperations)

# SEPARATE LISTS IF NEEDED (EQUAL LENGTH)
Gini <- lapply(new_res, "[[", 1)
Contrib <- lapply(new_res, "[[", 2)

【讨论】：

感谢您的帮助。我认为c(paste0("Gini_", data$year[[1]]) 中缺少括号。该解决方案有效，但不将收入_1 和收入_2 显示为列名。相反，我得到 Gini_RSV 和 Gini_RSV.1 作为 colnames。此外，我还遇到了另一个问题。在我尝试展示一个独立的示例时，我跳过了代码的一些重要部分，即 dfList 不是唯一的分组。有了这行新代码，我不知道要简化哪些分组并与 df 相关。我编辑我的查询以便添加这段代码，新的分组是dfOtherList。
我更新了第一期。其次，subdata 是从哪里得到的？前两个分组是否相关？如果没有，我们可能需要通过另一个 lapply 调用生成第二个数据帧列表。
我更新了代码。 subdata 是 year （我道歉）。 dfList 对 data 应用微积分 runCalc year，然后将其结果传递给 runOtherCalc 通过函数 dfOtherList。请参阅主题 stackoverflow.com/questions/38750824/…，您已在 2 年前为该主题提供过帮助。
哇！捂脸！当年我怎么想念by！我们生活和学习。请参阅将主函数（调用 by 两次）和辅助函数 runCalc 和 runOtherCalc 分开的更新。年份必须成为 runCalc 的 df 返回的新列。 Return 是一个嵌套的 df 对列表，可以在末尾拆分。
耶稣，代码越简洁，它对我的真实工作表的适用性就越低...有没有不需要将 year 创建为新列的解决方案在 runCalc 的 df 返回？您以前（2 年前）的帮助定义了一个函数 i 并将 year 均衡为 i 确实更合适，因为在我的实际程序中，year 可以是 year，但也可以是其他一些变量。