.data 和 cur_data() 之间的区别答案

【问题标题】：difference between .data and cur_data().data 和 cur_data() 之间的区别
【发布时间】：2021-12-23 17:28:43
【问题描述】：

m <- 10
mtcars %>% dplyr::mutate(disp = .data$disp * .env$m)

等价于

m <- 10
mtcars %>% dplyr::mutate(disp = cur_data()$disp * .env$m)

你能举一个例子，cur_data() 和 .data 会产生不同的结果吗？

有人告诉我，cur_data() 和 .data 在所有情况下都不能互换。

【问题讨论】：

标签： r dplyr

【解决方案1】：

这是取自here 的一个示例，它显示了不同的结果/错误

library(dplyr)
library(rstatix)
data %>%
     summarise(across(where(is.numeric),
      ~  cur_data() %>%
       levene_test(reformulate("Treatment", response = cur_column())))) %>%
    unclass %>% 
     bind_rows(.id = 'flux')
# A tibble: 3 × 5
  flux    df1   df2 statistic     p
  <chr> <int> <int>     <dbl> <dbl>
1 flux1     1     8     0.410 0.540
2 flux2     1     8     2.85  0.130
3 flux3     1     8     1.11  0.323
data %>%
     summarise(across(where(is.numeric),
      ~  .data %>%
       levene_test(reformulate("Treatment", response = cur_column())))) %>%
     unclass %>% 
     bind_rows(.id = 'flux')

错误：summarise() 输入 ..1 有问题。 ℹ..1 = across(...)。 ✖ 不能将“rlang_data_pronoun”类强制转换为 data.frame 运行rlang::last_error() 看看哪里出错了。

数据

data <- data.frame(site = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L), 
                                    .Label = c("S1 ", "S2 ", "S3 "), class = "factor"), 
                   plot = structure(c(1L, 2L, 3L, 1L, 2L, 3L, 1L, 2L, 3L, 1L), 
                                    .Label = c(" Tree 1 ", " Tree 2 ", " Tree 3 "), class = "factor"), 
                   Treatment = structure(c(2L, 2L, 2L, 1L, 1L, 1L, 2L, 2L, 2L, 1L), .Label = c("T1", "T2"), class = "factor"), 
                   flux1 = c(11.52188065, 8.43156699, 4.495312274, -1.866676811, 3.861102035, -0.814742373, 6.51039536, 4.767950345, 10.36544542, 1.065963875), 
                   flux2 = c(0.142259208, 0.04060245, 0.807631744, 0.060127596, -0.157762562, 0.062464942, 0.043147603, 0.495001652, 0.34363348, 0.134183704), 
                   flux3 = c(0.147506197, 1.131009714, 0.038860728, 0.0176834, 0.053191593, 0.047591306, 0.00573377, -0.034926075, 0.123379247, 0.018882469))

【讨论】：

【解决方案2】：

在 group_by 内 .data 仍包括所有列，但 cur_data() 不包括 group_by 列。例如，cur_data()[["cyl"]] 下面是 NULL，因为 cyl 是按列分组的，所以 x 不会出现在结果中，而 y 会出现。

library(dplyr)

mtcars %>%
  group_by(cyl) %>%
  mutate(x = cur_data()[["cyl"]], y = .data[["cyl"]]) %>%
  ungroup %>%
  names
##  [1] "mpg"  "cyl"  "disp" "hp"   "drat" "wt"   "qsec" "vs"   "am"   "gear"
## [11] "carb" "y"

【讨论】：

【解决方案3】：

添加到现有答案：

在vignette linked in the comments 中，我们找到以下引用：

请注意，.data 不是数据框；这是一个特殊的结构，一个代词，允许您访问当前变量直接使用 .data$x 或间接使用 .data[[var]]。 别指望使用它的其他功能。

重要的是要理解.data 是一个特殊的结构，它只是用来帮助我们访问变量。它既不是data.frame，也不是function。除了[[ 和$ 之外，大多数其他功能都不适用于.data。即使[ 也行不通。假设我们要使用.data 访问多个变量。如果.data. 将是data.frame，则以下将起作用，但它不起作用：

library(dplyr)

mtcars %>% 
  transmute(new = list(.data[c("disp", "hp")]))
#> Error: Problem with `mutate()` column `new`.
#> i `new = list(.data[c("disp", "hp")])`.
#> x `[` is not supported by .data pronoun, use `[[` or $ instead.

另一方面，cur_data() 是一个函数，它返回当前数据而不将变量分组为tibble（即使基础数据只是data.frame）。

就速度而言，cur_data() 与.data 或仅访问不带前缀的变量相比，开销非常小。以一个中等规模的数据集为例：

library(dplyr)
library(nycflights13)

bench::mark(iterations = 5000L,
            "none" = mutate(flights, new = arr_time),
            ".data" = mutate(flights, new = .data$arr_time),
            "cur_data()" = mutate(flights, new = cur_data()$arr_time))

#> # A tibble: 3 x 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 none         1.56ms   1.69ms      578.   132.9MB     15.1
#> 2 .data        1.53ms   1.72ms      551.    24.7KB     14.4
#> 3 cur_data()    1.6ms   1.77ms      535.    33.5KB     14.7

^{由reprex package (v0.3.0) 于 2021 年 12 月 23 日创建}

在互换性方面，我看到以下差异：

.data 不能用作data.frame，这意味着它不会返回底层数据，这与cur_data() 不同。
.data 一次只能用于访问一个变量，通过使用[[ 或$，而cur_data() 返回一个tibble，并且适用于所有适用于tibbles 的函数和data.frames.
就速度而言，使用cur_data() 并没有太大的开销，至少对于中等规模的数据集来说不是。这应该通过更多列的更大数据来验证。
.data 可用于访问分组变量，而 cur_data() 则无法访问。但是，cur_data_all() 是一个类似的函数，它也返回当前数据，但包括分组变量。这个后面的函数应该可以和.data完全互换，至少我想不出不能同时使用两者的情况。

library(dplyr)

mtcars %>%
  group_by(cyl) %>%
  transmute(x = cur_data()[["cyl"]],
            y = .data[["cyl"]],
            z = cur_data_all()[["cyl"]]) 

#> # A tibble: 32 x 3
#> # Groups:   cyl [3]
#>      cyl     y     z
#>    <dbl> <dbl> <dbl>
#>  1     6     6     6
#>  2     6     6     6
#>  3     4     4     4
#>  4     6     6     6
#>  5     8     8     8
#>  6     6     6     6
#>  7     8     8     8
#>  8     4     4     4
#>  9     4     4     4
#> 10     6     6     6
#> # … with 22 more rows

【讨论】：