【问题标题】:Writing an R function, which only subsets when stated编写一个 R 函数,它只在声明时子集
【发布时间】:2022-01-17 07:33:33
【问题描述】:

我正在尝试编写一个函数,该函数从数据框中为特定列(深度)提取均值、最小值和最大值,它可以按两个分类变量分类,因此在函数中按类型分组多变的。另一个分类变量是数据是在 2020 年或 2021 年收集的。我希望默认函数提取所有年份的数据,除非在参数中说明,然后按年份对数据进行子集化。如果我可以更改变量(例如长度而不是深度)也会很好。 这是我的代码

analysis <- function(data=measurements, yearX=2020){
  data %>%
    subset(year == yearX) %>%  ## Subsets the dataset by specific year
    group_by(type) %>%  ## groups the data by type 
    summarise(mBD=mean(depth), sdBD=sd(depth), minBD=min(depth),
              maxBD=max(depth), median=median(depth), 
              range=(max(depth) - min(depth)))
}

【问题讨论】:

    标签: r function arguments subset optional


    【解决方案1】:

    实现您想要的结果的一个选项可能如下所示:

    set.seed(123)
    
    measurements <- data.frame(
      year = rep(2020:2021, each = 10),
      type = rep(c("A", "B")),
      length = runif(20),
      depth = runif(20)
    )
    
    library(dplyr)
    
    analysis <- function(data = measurements, x, yearX = NULL) {
      # Subset by year if given
      if (!is.null(yearX)) data <- filter(data, year %in% yearX) 
      data %>%
        group_by(type) %>%
        summarise(across({{x}}, .fns = list(
          mBD = mean, 
          sdBD = sd, 
          minBD = min, 
          maxBD = max, 
          median = median, 
          range = ~ diff(range(.x))), .names = "{.fn}"
          ))
    }
    
    analysis(x = depth)
    #> # A tibble: 2 × 7
    #>   type    mBD  sdBD  minBD maxBD median range
    #>   <chr> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
    #> 1 A     0.577 0.290 0.0246 0.963  0.648 0.938
    #> 2 B     0.576 0.299 0.147  0.994  0.643 0.847
    
    analysis(measurements, depth, 2020)
    #> # A tibble: 2 × 7
    #>   type    mBD  sdBD minBD maxBD median range
    #>   <chr> <dbl> <dbl> <dbl> <dbl>  <dbl> <dbl>
    #> 1 A     0.604 0.217 0.289 0.890  0.641 0.600
    #> 2 B     0.627 0.307 0.147 0.994  0.693 0.847
    
    analysis(measurements, length, 2021)
    #> # A tibble: 2 × 7
    #>   type    mBD  sdBD  minBD maxBD median range
    #>   <chr> <dbl> <dbl>  <dbl> <dbl>  <dbl> <dbl>
    #> 1 A     0.462 0.348 0.103  0.957  0.328 0.854
    #> 2 B     0.584 0.370 0.0421 0.955  0.573 0.912
    

    【讨论】:

    • 非常感谢,这个功能完全符合我的预期。只是一个简单的问题,.fn 位是什么意思? (比如 .fns = list 和 .names = "{.fn}"?
    • 嗨,约翰。不客气。使用dplyr::across,您可以通过.fns 传递一个(命名的)函数列表,然后将其应用于传递给函数的列x。它更简洁一点,因为我们不必为每个函数重复参数。使用.names 参数,您可以指定应如何命名聚合数据集中的列。 “{.fn}”是glue 表示法,表示使用.fns 时为函数指定的名称标记每一列。
    【解决方案2】:

    为了受益于函数中subset() 的便利性,我们可以将match.call()matchsubset.defaultformalArgs 一起使用来创建一个subset 调用,我们可以使用evaluate .如果没有指定子集,则这些行就像省略一样。

    对于其余部分,我们定义了一个汇总函数,我们应该在其中定义当有NA's 时会发生什么,并在aggregate() 中使用它,并使用reformulate() 轻松创建的公式。

    通过案例处理,我们也可以省略分组。

    FUN <- function(..., col, group=NA, na.rm=FALSE) {
      cll <- match.call()
      m <- match(formalArgs(subset.default), names(cll), 0L)
      m <- cll[c(1L, m)]
      m[[1L]] <- quote(subset)
      dat <- eval(m)
      mysum <- function(x) c(mBD=mean(x, na.rm=na.rm), sdBD=sd(x, na.rm=na.rm), 
                             minBD=min(x, na.rm=na.rm), maxBD=max(x, na.rm=na.rm), 
                             median=median(x, na.rm=na.rm), 
                             range=max(x, na.rm=na.rm) - min(x, na.rm=na.rm))
      if (!is.na(group)) {
        res <- aggregate(reformulate(group, col), dat, mysum)
      } else {
        res <- mysum(dat[, col])
      }
      return(res)
    }
    

    用法

    FUN(x=measurements, col='depth', group='type')
    #   type  depth.mBD depth.sdBD depth.minBD depth.maxBD depth.median depth.range
    # 1    A 0.57739614 0.29037002  0.02461368  0.96302423   0.64810631  0.93841055
    # 2    B 0.57604555 0.29862847  0.14711365  0.99426978   0.64347271  0.84715613
    
    FUN(x=measurements, col='depth', group='type', subset=year == 2020)
    #   type depth.mBD depth.sdBD depth.minBD depth.maxBD depth.median depth.range
    # 1    A 0.6037955  0.2169419   0.2891597   0.8895393    0.6405068   0.6003796
    # 2    B 0.6273719  0.3070970   0.1471136   0.9942698    0.6928034   0.8471561
    
    FUN(x=measurements, col='length', group='type', subset=year == 2020)
    #   type length.mBD length.sdBD length.minBD length.maxBD length.median length.range
    # 1    A  0.5433124   0.2457008    0.2875775    0.9404673     0.5281055    0.6528898
    # 2    B  0.6131826   0.3633747    0.0455565    0.8924190     0.7883051    0.8468625
    
    FUN(x=measurements, col='depth', group=NA)
    #        mBD       sdBD      minBD      maxBD     median      range 
    # 0.57672085 0.28667353 0.02461368 0.99426978 0.64810631 0.96965609  
    

    数据(借自 stefan):

    measurements <- structure(list(year = c(2020L, 2020L, 2020L, 2020L, 2020L, 2020L, 
    2020L, 2020L, 2020L, 2020L, 2021L, 2021L, 2021L, 2021L, 2021L, 
    2021L, 2021L, 2021L, 2021L, 2021L), type = c("A", "B", "A", "B", 
    "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", "B", "A", 
    "B", "A", "B"), length = c(0.287577520124614, 0.788305135443807, 
    0.4089769218117, 0.883017404004931, 0.940467284293845, 0.0455564993899316, 
    0.528105488047004, 0.892419044394046, 0.551435014465824, 0.456614735303447, 
    0.956833345349878, 0.453334156190977, 0.677570635452867, 0.572633401956409, 
    0.102924682665616, 0.899824970401824, 0.24608773435466, 0.0420595335308462, 
    0.327920719282702, 0.954503649147227), depth = c(0.889539316063747, 
    0.6928034061566, 0.640506813768297, 0.994269776623696, 0.655705799115822, 
    0.708530468167737, 0.544066024711356, 0.59414202044718, 0.28915973729454, 
    0.147113647311926, 0.963024232536554, 0.902299045119435, 0.690705278422683, 
    0.795467417687178, 0.0246136845089495, 0.477795971091837, 0.758459537522867, 
    0.216407935833558, 0.318181007634848, 0.231625785352662)), class = "data.frame", row.names = c(NA, 
    -20L))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2019-05-26
      • 1970-01-01
      • 1970-01-01
      • 2021-05-30
      • 2017-07-02
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多