如何将 `recipes::step_dummy()` 翻译成 `dplyr`/`tidyr` 代码？答案

【问题标题】：How to translate `recipes::step_dummy()` to `dplyr`/`tidyr` code?如何将 `recipes::step_dummy()` 翻译成 `dplyr`/`tidyr` 代码？
【发布时间】：2022-01-19 13:21:36
【问题描述】：

我试图弄清楚来自recipes 包的step_dummy() 如何处理数据。虽然这个函数有一个reference page，但我仍然无法理解如何使用我知道的“常规”tidyverse 工具来完成它。这是一些基于recipes 和rsample 包的代码。我想实现相同的数据输出，但只使用dplyr/tidyr 工具。

我从ggplot2 中选择了diamonds 数据集进行演示。

library(rsample)
library(recipes)

my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split  <- initial_split(my_diamonds, prop = .1)
d_training  <- training(init_split)

d_training_dummied_using_recipe <-
  recipe(formula = price ~ ., data = d_training) %>%
  step_dummy(all_nominal()) %>% 
  prep() %>%
  bake(new_data = NULL) # equivalent to `juice()`. It means to get the training data (`d_training`) after the steps in the recipe were applied to it.

d_training_dummied_using_recipe
#> # A tibble: 5,394 x 6
#>    carat price  cut_1  cut_2     cut_3  cut_4
#>    <dbl> <int>  <dbl>  <dbl>     <dbl>  <dbl>
#>  1  0.5   1678 -0.316 -0.267  6.32e- 1 -0.478
#>  2  0.7   2608 -0.316 -0.267  6.32e- 1 -0.478
#>  3  1.7   9996  0.316 -0.267 -6.32e- 1 -0.478
#>  4  0.73  1824  0.316 -0.267 -6.32e- 1 -0.478
#>  5  0.4    988  0.632  0.535  3.16e- 1  0.120
#>  6  1.04  4240  0.316 -0.267 -6.32e- 1 -0.478
#>  7  0.9   3950  0     -0.535 -4.10e-16  0.717
#>  8  0.4   1116  0     -0.535 -4.10e-16  0.717
#>  9  1.34 10070  0.632  0.535  3.16e- 1  0.120
#> 10  0.6    806  0.316 -0.267 -6.32e- 1 -0.478
#> # ... with 5,384 more rows

我的问题是，在给定d_training 的情况下，我们如何通过使用dplyr 或tidyr（可能还有forcats）函数得到与d_training_dummied_using_recipe 相同的输出？我看过this one之类的帖子，但它们似乎不适合当前的情况。

编辑

很明显，step_dummy() 只在cut 列上运行，这是因为我们指定了all_nominal()。事实上，cut 是d_training 中唯一的名义变量。我以为cut_*列对应cut的级别，但后来我跑了：

levels(d_training$cut)
#> [1] "Fair"      "Good"      "Very Good" "Premium"   "Ideal"

显示 6 个级别，而只有 4 个cut_* 列。所以这是理解正在发生的事情的一个限制。另外，cut_*中的那些值是怎么产生的？

编辑 2

我遇到了最相关的小插曲How are categorical predictors handled in recipes?，它直接讨论了这个话题。

R 中的对比函数是一种将具有分类值的列转换为一个或多个代替原始值的数字列的方法。这也可以称为编码方法或参数化函数。

默认方法是使用“参考单元”参数化创建虚拟变量。这意味着，如果因子有 C 个水平，则将创建 C - 1 个虚拟变量，并且除第一个因子水平之外的所有变量都被制成新列

关于级别数与cut_* 列数，小插图明确表示：

请注意，列名不引用 [...] 变量的特定级别。此对比函数具有可涉及多个级别的列；特定级别的列没有意义。

但最终没有示例如何使用常规工具（不在recipes 上下文中）进行相同的操作。所以我最初的问题仍未解决。

【问题讨论】：

标签： r dplyr tidyr dummy-variable r-recipes

【解决方案1】：

你可以看source code for step_dummy();我不确定我是否会称其为黑匣子本身。请注意，在 bake() 期间，它使用来自基本 R 的 model.matrix()。

library(rsample)
library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step

data(diamonds, package = "ggplot2")

my_diamonds <- diamonds[, c("carat", "cut", "price")]
init_split  <- initial_split(my_diamonds, prop = .1)
d_training  <- training(init_split)

d_training_dummied_using_recipe <-
  recipe(formula = price ~ ., data = d_training) %>%
  step_dummy(all_nominal()) %>% 
  prep() %>%
  bake(new_data = NULL) 

d_training_dummied_using_recipe
#> # A tibble: 5,394 × 6
#>    carat price     cut_1  cut_2     cut_3  cut_4
#>    <dbl> <int>     <dbl>  <dbl>     <dbl>  <dbl>
#>  1  0.31   544 -3.16e- 1 -0.267  6.32e- 1 -0.478
#>  2  0.72  3294  6.32e- 1  0.535  3.16e- 1  0.120
#>  3  0.7   2257 -1.48e-18 -0.535 -3.89e-16  0.717
#>  4  0.5   1446  6.32e- 1  0.535  3.16e- 1  0.120
#>  5  0.31   772  6.32e- 1  0.535  3.16e- 1  0.120
#>  6  1.01  3733  3.16e- 1 -0.267 -6.32e- 1 -0.478
#>  7  0.31   942  6.32e- 1  0.535  3.16e- 1  0.120
#>  8  0.43   903 -3.16e- 1 -0.267  6.32e- 1 -0.478
#>  9  1.21  4391  3.16e- 1 -0.267 -6.32e- 1 -0.478
#> 10  1.37  5370  3.16e- 1 -0.267 -6.32e- 1 -0.478
#> # … with 5,384 more rows


model.matrix(price ~ .,
             data = d_training) %>%
  as_tibble()
#> # A tibble: 5,394 × 6
#>    `(Intercept)` carat     cut.L  cut.Q     cut.C `cut^4`
#>            <dbl> <dbl>     <dbl>  <dbl>     <dbl>   <dbl>
#>  1             1  0.31 -3.16e- 1 -0.267  6.32e- 1  -0.478
#>  2             1  0.72  6.32e- 1  0.535  3.16e- 1   0.120
#>  3             1  0.7  -1.48e-18 -0.535 -3.89e-16   0.717
#>  4             1  0.5   6.32e- 1  0.535  3.16e- 1   0.120
#>  5             1  0.31  6.32e- 1  0.535  3.16e- 1   0.120
#>  6             1  1.01  3.16e- 1 -0.267 -6.32e- 1  -0.478
#>  7             1  0.31  6.32e- 1  0.535  3.16e- 1   0.120
#>  8             1  0.43 -3.16e- 1 -0.267  6.32e- 1  -0.478
#>  9             1  1.21  3.16e- 1 -0.267 -6.32e- 1  -0.478
#> 10             1  1.37  3.16e- 1 -0.267 -6.32e- 1  -0.478
#> # … with 5,384 more rows

^{由reprex package (v2.0.1) 于 2021 年 12 月 30 日创建}

创建这些指标变量的配方实现为从训练数据学习和应用于新数据或测试数据以及更标准的命名等设置了一些保护和便利。这可能是一个特别令人困惑的例子，因为cut是有序因子。

【讨论】：

【解决方案2】：

这只是一半的答案，但这应该可以帮助您了解cut_* 列的映射方式。试试这个链接以获得更详细的外观：https://recipes.tidymodels.org/articles/Dummies.html

library(tidyverse)
library(recipes)


diamonds |> 
  select(carat, cut, price) |>
  mutate(original = cut) |>
  (\(d) recipe(formula = price ~ ., data = d))() |>
  step_dummy(cut) |>
  prep()|>
  bake(new_data = NULL, original, starts_with("cut")) |>
  distinct() 
#> # A tibble: 5 x 5
#>   original   cut_1  cut_2     cut_3  cut_4
#>   <ord>      <dbl>  <dbl>     <dbl>  <dbl>
#> 1 Ideal      0.632  0.535  3.16e- 1  0.120
#> 2 Premium    0.316 -0.267 -6.32e- 1 -0.478
#> 3 Good      -0.316 -0.267  6.32e- 1 -0.478
#> 4 Very Good  0     -0.535 -4.10e-16  0.717
#> 5 Fair      -0.632  0.535 -3.16e- 1  0.120

编辑：

这里有更多细节：

contr.poly(levels(diamonds$cut))
#>              .L         .Q            .C         ^4
#> [1,] -0.6324555  0.5345225 -3.162278e-01  0.1195229
#> [2,] -0.3162278 -0.2672612  6.324555e-01 -0.4780914
#> [3,]  0.0000000 -0.5345225 -4.095972e-16  0.7171372
#> [4,]  0.3162278 -0.2672612 -6.324555e-01 -0.4780914
#> [5,]  0.6324555  0.5345225  3.162278e-01  0.1195229

cut_* 列表示来自contr.poly 的映射和剪切级别。注意切割列与contr.poly 矩阵的相同之处。

【讨论】：

感谢您的回复！不幸的是，这仍然没有解决step_dummy() 黑盒功能。您引用的链接与我在 EDIT 2 下讨论的链接相同。
我的编辑有帮助吗？我想你可以从contr.poly 看到从 cut 到 cut_* 列的方式。