【问题标题】:Omitting NAs and empty data frames with map_if使用 map_if 省略 NA 和空数据帧
【发布时间】:2019-04-16 05:46:57
【问题描述】:

我正在使用如下的 tibble:

ex <- structure(list(rowid = c(4L, 5L, 6L, 9L, 10L), timestamp = structure(c(1502480694.03336, 
1502480695.44736, 1502480696.03336, 1502480703.99836, 1502480706.19936
), class = c("POSIXct", "POSIXt"), tzone = "UTC"), cat = c(32L, 
1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, NA_integer_, 
NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
    var2 = c(0, 50, 29.7, 51, 70.8), var3 = c(NA, 26.3, 24, 20.5, 
    12), order = c(NA, 1L, 1L, 1L, 1L), bfr = list(NA, structure(list(
        rowid = integer(0), timestamp = structure(numeric(0), class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = integer(0), var1 = structure(integer(0), .Label = "1", class = "factor"), 
        var2 = numeric(0), var3 = numeric(0), order = integer(0)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = integer(0)), structure(list(
        rowid = 5L, timestamp = structure(1502480695.44736, class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = 1L, var1 = structure(NA_integer_, .Label = "1", class = "factor"), 
        var2 = 50, var3 = 26.3, order = 1L), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -1L)), structure(list(
        rowid = 5:8, timestamp = structure(c(1502480695.44736, 
        1502480696.03336, 1502480699.03336, 1502480701.03336), class = c("POSIXct", 
        "POSIXt"), tzone = "UTC"), cat = c(1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, 
        NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
        var2 = c(50, 29.7, 52.8, 44), var3 = c(26.3, 24, 8.9, 
        12.4), order = c(1L, 1L, 1L, 1L)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -4L)), structure(list(
        rowid = 5:9, timestamp = structure(c(1502480695.44736, 
        1502480696.03336, 1502480699.03336, 1502480701.03336, 
        1502480703.99836), class = c("POSIXct", "POSIXt"), tzone = "UTC"), 
        cat = c(1L, 1L, 1L, 1L, 1L), var1 = structure(c(NA_integer_, 
        NA_integer_, NA_integer_, NA_integer_, NA_integer_), .Label = "1", class = "factor"), 
        var2 = c(50, 29.7, 52.8, 44, 51), var3 = c(26.3, 24, 
        8.9, 12.4, 20.5), order = c(1L, 1L, 1L, 1L, 1L)), class = c("tbl_df", 
    "tbl", "data.frame"), row.names = c(NA, -5L)))), row.names = c(4L, 
5L, 6L, 9L, 10L), class = "data.frame")

我想用map 总结bfr 列中的嵌套小标题。为了省略不必要的计算,我想使用map_if,它会在bfr 包含少于2 行cat == 1 时跳过该行。然而,由于NAs 和bfr 列中的空小标题的存在,我正在努力编写适当的谓词函数。这是我的尝试:

more_than <- function(df){
  if (nrow(df) == 0 | is.na(df)) return(FALSE)

  n <- df %>% 
    summarise(sum(cat == 1)) %>% 
    as.numeric()

  return(n > 2)
}

ex %>% 
  mutate(mean_var2 = map_if(bfr, more_than, 
                            ~.x %>% summarise(mean_var2 = mean(var2))))

导致:

if (nrow(df) == 0 | is.na(df)) return(FALSE) 中的错误: 参数长度为零

如何处理 NAs 和空 tibbles 的存在以编写适当的谓词函数?

【问题讨论】:

  • 问题在于is.na(df),它对整个数据进行 NA 检查,而 nrow 是单个输出
  • 另外,在more_than 中,您正在进行一些其他计算,而这些计算在mean_var2 中没有作为输出得到
  • 抱歉,我没有收到您的第一条评论 - 您能否详细说明您的答案? more_than 只是一个谓词,以避免对 bfr 列的某些元素进行不必要的计算。

标签: r predicate purrr


【解决方案1】:

如果打算获取“var2”列的mean,请检查list 元素是data.frame 还是tibble(在这种情况下它是一个小标题),然后执行summarise

out <-  ex %>% 
           mutate(mean_var2 = map_if(bfr, is.tibble, ~ 
             .x %>% 
                summarise(mean_var2 = mean(var2, na.rm = TRUE))))

如果我们还需要查看sum(cat ==1) &gt; 2

more_than <- function(df){
i1 <- is_tibble(df)
if(i1) {
   n <- df %>% 
    summarise(v1 = sum(cat == 1))  %>%
    pull(v1) 
    }

    i1 && (n > 2)


}
ex %>%
  mutate(mean_var2 = map_if(bfr, more_than, ~
      .x %>%
         summarise(mean_var2 = mean(var2, na.rm = TRUE))))

is.na 不起作用的原因是因为它会检查每个数据集,其中一些是 tibble,这会返回一个逻辑 matrix,而 if/else 期望单个 TRUE/FALSE返回。例如

(3 == 4) & (cbind(3:5, 1:3) == 3)

产生不同的输出

一个选项是使用&amp;&amp;,这样它仅在第一个条件为 TRUE 时才评估 rhs 条件,从而避免不必要的评估

(3 == 4) && (cbind(3:5, 1:3) == 3)
#[1] FALSE

在 OP 的原始函数中,如果我们将 | 替换为 || 它应该可以正常工作

more_than <- function(df){
  if (nrow(df) == 0 || is.na(df)) return(FALSE)

  n <- df %>% 
    summarise(sum(cat == 1)) %>% 
    as.numeric()

  return(n > 2)
}

更新

如果我们想为那些不满足的情况返回 NA

ex %>%
    mutate(mean_var2 = map_dbl(bfr, ~ 
    if(is_tibble(.x) && sum(.x$cat == 1) > 2) mean(.x$var2, na.rm = TRUE) else NA))

或者另一种选择是使用possibly(类似于tryCatch

posmean <- possibly(function(dat) if(sum(dat$cat == 1) > 2) 
     mean(dat$var2, na.rm  = TRUE) else NA_real_, otherwise = NA_real_)
ex %>% 
     mutate(mean_var2 = map_dbl(bfr, posmean))

【讨论】:

  • 太棒了!现在可以了。我想知道为什么is.na 不起作用。您能否在空闲时间详细说明您在我的帖子下的第一条评论?谢谢!
  • ex %&gt;% mutate(mean_var2 = map_if(bfr, more_than, ~.x %&gt;%summarise(mean_var2 = mean(var2,na.rm = TRUE)),.else = NA_integer_))define .else 可以解决问题,.else A function applied to elements of .x for which .p returns FALSE.
  • @jakes 你可以使用ex %&gt;% mutate(mean_var2 = map_dbl(bfr, ~ if(is_tibble(.x) &amp;&amp; sum(.x$cat == 1) &gt; 2) mean(.x$var2, na.rm = TRUE) else NA))
  • @A.Suliman 这是一个不错的选择。我不知道。您应该将其发布为答案
  • @akrun 我们需要.else 的函数,我用dplyr::first 进行了测试,它按预期工作,因此我定义了一个类似foo &lt;- function(x){return(NA)} 的函数,所以我们最终得到ex %&gt;% mutate(mean_var2 = map_if(bfr, more_than, ~.x %&gt;% summarise(mean_var2 = mean(var2,na.rm = TRUE)),.else = foo)) %&gt;% select(mean_var2),当然你会找到更好的东西。另一种选择.else = ~return(NA)
【解决方案2】:

首先,我们需要在检查 nrow 之前使用 ||“查看 | 和 || here 之间的区别”检查 NA。然后我们需要.else,即:

.else 应用于 .x 元素的函数,.p 返回 FALSE。

more_than返回FLASE时

more_than <- function(df){
 # browser()
  if (all(is.na(df)) || nrow(df) == 0) return(FALSE)

     n <- df %>%
       summarise(sum(cat == 1)) %>%
       as.numeric()

     return(n > 2)
}

ex %>% 
mutate(mean_var2 = map_if(bfr, more_than, 
                          ~.x %>% summarise(mean_var2 = mean(var2,na.rm = TRUE)),
                         .else = ~return(NA))) %>% 
select(mean_var2)

   mean_var2
1        NA
2        NA
3        NA
4    44.125
5      45.5

【讨论】:

    猜你喜欢
    • 2012-12-08
    • 2015-07-07
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2022-09-27
    • 1970-01-01
    • 1970-01-01
    • 2016-01-31
    相关资源
    最近更新 更多