【问题标题】:Using "summarise" (dplyr) with a function requiring a formula将“summarise”(dplyr)与需要公式的函数一起使用
【发布时间】:2020-12-07 18:50:54
【问题描述】:

我正在尝试生成由基于 mblm 包的自定义函数生成的回归斜率表(此处示例中的函数是简化版本)。该函数需要一个公式作为参数,我想使用 dplyr summarise 将其应用于来自具有许多变量的大型数据帧的分组样本。 输出应该是我可以传递给热图函数的样本组和响应变量的回归斜率。

library (dplyr)

# Example data

test_data <- 
    rbind (
        data.frame(ID=paste0("someName", c(1:9)), Sample_Type="type1", 
           A=seq(1,17, length.out=9),
           I=0.1^seq(1,1.8,length.out=9), 
           J=1-0.1^seq(1,1.8,length.out=9)),
        data.frame(ID=paste0("someName", c(10:15)), Sample_Type="type2", 
           A=seq(1,7, length.out=6), 
           I=0.1^(1-seq(1,1.5,length.out=6)),
           J=1-0.1^(1-seq(1,1.5,length.out=6))))

# Define an independent and the responding variables - I would like to be able to easily test different independent variables
 
idpVar <- "A"
respVar <- test_data %>% .[!names(.) %in% c("ID", "Sample_Type", idpVar)] %>% names()

# Custom function generating numeric value of median slopes (simplified from mblm)

medianSlope <-
function (formula, dataframe) 
{
    if (missing(dataframe)) 
        dataframe <- environment(formula)
    term <- as.character(attr(terms(formula), "variables")[-1])
    x = dataframe[[term[2]]]
    y = dataframe[[term[1]]]
    if (length(term) > 2) {
        stop("Only linear models are accepted")
    }
    xx = sort(x)
    yy = y[order(x)]
    n = length(xx)
    slopes = c()
    smedians = c()
        for (i in 1:n) {
            slopes = c()
            for (j in 1:n) {
                if (xx[j] != xx[i]) {
                  slopes = c(slopes, (yy[j] - yy[i])/(xx[j] - 
                    xx[i]))
                 }
            }
            smedians = c(smedians, median(slopes))
        }
        slope = median(smedians)
    
    slope
}

# Custom function works with test dataframe and a single named dependent variable but "group_by" seems to be ignored:

test_data %>% group_by (Sample_Type) %>% medianSlope( formula(paste("J", "~", idpVar))  ,.)

暂时将分组问题放在一边,我尝试通过生成多个公式的列表来进行“总结”:

粘贴(respVar, "~", idpVar) [1]“B~A”“C~A”“D~A”“E~A”“F~A”“G~A”“H~A”“I~A”“J~A”“K 〜A”“L〜A”

然而

test_data %>% summarise_at (respVar, medianSlope(paste(respVar, "~", idpVar), .))

错误:$ 运算符对原子向量无效

test_data %>% summarise_at (respVar, medianSlope(paste(get(respVar), "~", get(idpVar)), .))

get(idpVar) 中的错误:找不到对象“A”

我对 R 比较陌生,有点迷茫。你能帮忙吗?

【问题讨论】:

    标签: r dplyr summarize nse


    【解决方案1】:

    我不确定这是否能够使用summarise_at 函数来完成。但是,我们可以结合使用map_dblby 和其他一些数据清理函数来执行计算:

    library(tidyverse)
    
    # split the data using `by` (acts as a group_by)
    # use `map_dbl` to iterate over the variables in respVar
    # we use setNames so that the returned vector from map_dbl is named
    # then, bind the rows together, convert to data frame
    # finally convert row names (groups) to a column
    by(test_data, test_data$Sample_Type,
           FUN = function(d) map_dbl(setNames(respVar, respVar), 
                                 ~medianSlope(formula(paste(.x, "~", idpVar)), 
                                              data = d))) %>%
        do.call("rbind", .) %>%
        as.data.frame() %>%
        rownames_to_column(var = "Sample_Type")
    
      Sample_Type            I            J
    1       type1 -0.004623987  0.004623987
    2       type2  0.341974269 -0.341974269
    

    【讨论】:

    • Bouncyball,感谢您的快速回复。我复制了您的解决方案并对其进行了少量修改: map (respVar, ~medianSlopes(formula(paste(.x, "~", idpVar)), data = test_data), .)
      但是,变量的分组仍然被忽略。我已经编辑了 test_data 和示例代码,以更清楚地说明我想要实现的目标。
    • {r} test_data %&gt;% group_by (Sample_Type) %&gt;% nest() %&gt;% map (respVar, ~medianSlopes(formula(paste(.x, "~", idpVar)), data = test_data), .) %&gt;% setNames(respVar) %&gt;% bind_rows() #instead of bind_cols() generate: # A tibble: 2 x 2 1 2 1 -0.00572 0.000418 2 0.00572 -0.000418 但是,我想要的输出是# A tibble : 2 x 3 Sample_Type IJ type1 -0.000418 0.000418 type2 0.3419743 -0.3419743
    • 更正:由 test_data 生成的值的输出应该是 #A tibble: 2 x 3 Sample_Type IJ type1 -0.004623987 0.004623987 type2 0.3419743 -0.3419743 type1 -0.00462 0.00462类型2 0.292 -0.292
    • @ThomasW 感谢这些笔记。我已经完全更新了我的答案。老实说,我不知道这是否可以使用summarise_at 来完成,所以我提出了一个替代解决方案。我想知道如果不是在函数中使用formula 接口,而是使用xy 参数,它可能会更容易吗?
    【解决方案2】:

    Bouncyball,再次感谢您的帮助。似乎“总结”和“变异”确实不能调用使用公式作为输入的函数,尽管我没有看到其他地方对此进行了解释。解决方法很有启发性,但我遵循了您的替代建议并重写了被调用的函数。 仍然是一个学习者,我给自己设定了一个挑战,即替换 mblm 派生代码中的“for”循环,并消除看似冗余的计算(以对 RAM 的更高需求为代价——但它仍然运行得更快)我的 PC 上的数据,我计划在下一步开发代码中重新使用 dx 矩阵)。两种解决方案如下。 干杯,托马斯

    mblm_2_short <-        # code adapted from mblm(y ~ x, repeated = T), for calculation of repeat median slope only 
    function (x, y) 
    {
    xx = sort(x)
    yy = y[order(x)]
    n = length(xx)
    slopes = c()
    smedians = c()
    
        for (i in 1:n) {
            slopes = c()
            for (j in 1:n) {
                if (xx[j] != xx[i]) {
                  slopes = c(slopes, (yy[j] - yy[i])/(xx[j] - 
                    xx[i]))
                 }
            }
            smedians = c(smedians, median(slopes))
        }
        slope = median(smedians)
    }
    

    .

    med_slopesMed <-      # repeat median slope- like mblm(y ~ x, repeated = T), slope only                     
    function (xx, yy) 
    {
      x = sort(xx)
      y = yy[order(xx)]
      n = length(x)
      
      dx = matrix (rep (0,n^2), ncol=n)
      dy = c()
      z  = matrix (rep (0,n^2), ncol=n)
      
      for (i in 1:(n-1)) {                   ### x-axis distances (dx) and slopes (z) between points
        dxi             = x[-(1:i)]-x[i]
        dx [i, (i+1):n] = dxi                # for points 1:n, x-axis distances to all other points
        dyi             = y[-(1:i)]-y[i]
        zi              = dyi/dxi           
        z [i, (i+1):n]  = zi                 # for points 1:n, linear slopes connecting with all other points
      }
      
      z = replace(z, is.infinite(z), NA)           # removes +/-Inf and NaN generated by dx=0
      z  = t(z)[,-n] + z[,-1]
    
      median (apply(z, 1, median, na.rm=T)) 
    }
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2015-10-27
      • 2016-08-07
      • 1970-01-01
      • 2022-11-23
      • 1970-01-01
      • 2021-07-20
      • 2018-04-30
      相关资源
      最近更新 更多