从数据框中的所有分类变量创建虚拟变量答案

【问题标题】：Create dummy variables from all categorical variables in a dataframe从数据框中的所有分类变量创建虚拟变量
【发布时间】：2019-05-05 04:59:44
【问题描述】：

我需要对数据框中的所有分类列进行一次编码。我发现了这样的东西：

one_hot <- function(df, key) {
  key_col <- dplyr::select_var(names(df), !! rlang::enquo(key))
  df <- df %>% mutate(.value = 1, .id = seq(n()))
  df <- df %>% tidyr::spread_(key_col, ".value", fill = 0, sep = "_") %>% 
  select(-.id)
}

但我不知道如何将它应用于所有分类列。

keys <- select_if(data, is.character)[-c(1:2)]
tmp <- map(keys, function(names) reduce(data, ~one_hot(.x, keys)))

抛出下一个错误

错误：var 必须计算为单个数字或列名，而不是列表

更新：

customers <- data.frame(
  id=c(10, 20, 30, 40, 50),
  gender=c('male', 'female', 'female', 'male', 'female'),
  mood=c('happy', 'sad', 'happy', 'sad','happy'),
  outcome=c(1, 1, 0, 0, 0))
customers

编码后

  id gender.female gender.male mood.happy mood.sad outcome
1 10             0           1          1        0       1
2 20             1           0          0        1       1
3 30             1           0          1        0       0
4 40             0           1          0        1       0
5 50             1           0          1        0       0

【问题讨论】：

您能否提供一个小的示例数据框以及您希望该数据框的结果是什么样的？这将帮助人们回答您的问题。
完成。但想象一下，我有更多的分类特征

标签： r tidyverse one-hot-encoding

【解决方案1】：

还带有fastDummies 包的单线。

fastDummies::dummy_cols(customers)

  id gender  mood outcome gender_male gender_female mood_happy mood_sad
1 10   male happy       1           1             0          1        0
2 20 female   sad       1           0             1          0        1
3 30 female happy       0           0             1          1        0
4 40   male   sad       0           1             0          0        1
5 50 female happy       0           0             1          1        0

【讨论】：

【解决方案2】：

使用dummies 包：

library(dummies)
dummy.data.frame(customers)

  id genderfemale gendermale moodhappy moodsad outcome
1 10            0          1         1       0       1
2 20            1          0         0       1       1
3 30            1          0         1       0       0
4 40            0          1         0       1       0
5 50            1          0         1       0       0

【讨论】：

【解决方案3】：

这是使用recipes 包的方法。

library(dplyr)
library(recipes)

# Declares which variables are the predictors
recipe(formula = outcome ~ .,
       data = customers) %>% 
# Declare that one-hot encoding will be applied to all nominal variables
step_dummy(all_nominal(),
           one_hot = TRUE) %>% 
# Based on the previous declarations, apply transformations to the data
# and return the resulting data frame
prep() %>% 
juice()

【讨论】：

感谢您的回答。 recipes::step_dummy 提供了很大的灵活性，可以以比其他方法更明显的方式轻松调用变量子集上的虚拟变量。

【解决方案4】：

mltools 和 data.table 的单行代码：

one_hot(as.data.table(customers))

   id gender_female gender_male mood_happy mood_sad outcome
1: 10             0           1          1        0       1
2: 20             1           0          0        1       1
3: 30             1           0          1        0       0
4: 40             0           1          0        1       0
5: 50             1           0          1        0       0

它一次性处理所有因子变量，并内置了一些关于如何处理 NA 和未使用的因子水平的好功能。

【讨论】：