【问题标题】:Using column name as function argument in R在 R 中使用列名作为函数参数
【发布时间】:2021-05-16 09:27:58
【问题描述】:

我正在尝试创建一个 R 函数来将平均值归入数据框中的特定列。

impute_means <- function(df, group_by, column){
  
  vals_to_impute <- df %>%
    group_by_at(group_by) %>%
    summarise(x = mean(get(column), na.rm = TRUE))
  
  df %>%
    filter(is.na(get(column))) %>%
    select(group_by, column) %>%
    left_join(vals_to_impute, by=group_by)
}

impute_means(df = weather_data, group_by = c("year","month","code","type"), column = "temperature")

函数当前返回这个:

但是,现在我想检查“温度”列中的 NA 值并将其替换为 x 列中的值。

我试图通过在末尾添加 mutate 语句来做到这一点,但它似乎不起作用

impute_means <- function(df, group_by, column){
  
  vals_to_impute <- df %>%
    group_by_at(group_by) %>%
    summarise(x = mean(get(column), na.rm = TRUE))
  
  df %>%
    filter(is.na(get(column))) %>%
    select(group_by, column) %>%
    left_join(vals_to_impute, by=group_by) %>%
    mutate(column = case_when(is.na(get(column))~x,
                                   TRUE~get(column)))
}

要重现的最少数据:

天气数据

structure(list(year = structure(c(8L, 8L, 1L, 1L, 2L, 2L, 3L, 
3L, 5L, 6L), .Label = c("2000", "2001", "2002", "2003", "2004", 
"2005", "2006", "2007", "2008", "2009", "2010", "2011", "2012", 
"2013", "2014", "2015", "2016", "2017", "2018", "2019"), class = "factor"), 
    month = structure(c(12L, 12L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
    2L), .Label = c("1", "2", "3", "4", "5", "6", "7", "8", "9", 
    "10", "11", "12"), class = "factor"), code = structure(c(1L, 
    2L, 6L, 1L, 6L, 2L, 2L, 2L, 6L, 2L), .Label = c("1", "2", 
    "3", "4", "5", "6"), class = "factor"), type = structure(c(2L, 
    2L, 6L, 2L, 6L, 2L, 2L, 3L, 6L, 3L), .Label = c("1", "2", 
    "3", "4", "5", "6"), class = "factor"), temperature = c(NA, 
    NA, 20.8, 19.5, 1.4, 3.1, 27.3, 25.4, 20.2, 26.6)), row.names = c(NA, 
-10L), class = c("tbl_df", "tbl", "data.frame"))

【问题讨论】:

  • 我不太确定您要做什么,但我认为您可以通过逐步进行而不是尝试在一行中完成所有操作会更轻松。只需从stuff_to_calculate_mean &lt;- df[,columns] 之类的行开始,然后从那里继续
  • @RonakShah 添加了
  • 请记住,均值插补是一种次优方法(例如 123),还有更好的替代方法。

标签: r function dplyr


【解决方案1】:

你可以做-

library(dplyr)

impute_means <- function(df, group_by, column){
  
  df %>%
    mutate(val = .data[[column]]) %>%
    group_by(across(all_of(group_by))) %>%
    mutate(!!column := mean(.data[[column]], na.rm = TRUE)) %>%
    filter(is.na(val)) %>%
    select(-val) %>% 
    ungroup
}

impute_means(df = weather_data, 
             group_by = c("year","month","code","type"), 
             column = "temperature")

我使用mutate 来维护数据中的行数,而不是summarise 处理数据并执行连接。

如果您觉得更容易理解,可以将 .data[[column]] 替换为 get(column)。它们都应该以相同的方式工作。

【讨论】:

  • 我想我错过的是 mutate 函数中的:=
【解决方案2】:

你可以试试这个功能吗?!

impute_means <- function(df, group_by, column){
  
  df %>% 
    group_by_at(group_by) %>% 
    mutate(across(c(column), mean))
}

或者如果您需要一个新列:

impute_means <- function(df, group_by, column){
  
  df %>% 
    group_by_at(group_by) %>% 
    mutate(x=across(c(column), mean))
}
  

【讨论】: