如何基于 R 中的两列创建假人答案

【问题标题】：How to create dummies based on two columns in R如何基于 R 中的两列创建假人
【发布时间】：2021-07-17 13:35:34
【问题描述】：

假设我有一个数据框：性别可以取 F 为女性或 M 为男性种族可以把 A 作为亚洲人，W 作为白人，B 作为黑人，H 作为西班牙裔

| id | Gender | Race |
| --- | ----- | ---- |
| 1   | F    | W |
| 2   | F    | B |
| 3   | M    | A |
| 4   | F    | B |
| 5   | M    | W |
| 6   | M    | B |
| 7   | F    | H |

我想有一组基于性别和种族的列作为虚拟对象，数据框应该是这样的

| id | Gender | Race | F_W | F_B | F_A | F_H | M_W | M_B | M_A | M_H |
| --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1   | F    | W   |  1  |  0  |  0  |  0  |  0  |  0  |  0  |  0  |
| 2   | F    | B   |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
| 3   | M    | A   |  0  |  0  |  0  |  0  |  0  |  0  |  1  |  0  |
| 4   | F    | B   |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
| 5   | M    | W   |  0  |  0  |  0  |  0  |  1  |  0  |  0  |  0  |
| 6   | M    | B   |  0  |  0  |  0  |  0  |  0  |  1  |  0  |  0  |
| 7   | F    | H   |  0  |  0  |  0  |  1  |  0  |  0  |  0  |  0  |

我的实际数据包含的类别比此示例多得多，因此如果您能以更简洁的方式制作它，我将不胜感激。语言是R。感谢您的帮助。

【问题讨论】：

标签： r dummy-variable

【解决方案1】：

除了列名之外，您还可以使用 model.matrix 函数和一个仅表示交互项并减去截距的公式来获得：

> dm = cbind(d,model.matrix(~Gender:Race-1, data=d))
> dm
   id Gender Race GenderF:RaceA GenderM:RaceA GenderF:RaceB GenderM:RaceB
1   1      F    H             0             0             0             0
2   2      M    H             0             0             0             0
3   3      M    W             0             0             0             0
4   4      F    H             0             0             0             0
5   5      M    H             0             0             0             0
[etc]

如果您关心确切的名称，可以很容易地通过一些字符串处理来对它们进行排序。

> names(dm)[-(1:3)] = sub("Gender","",sub("Race","",sub(":","_",names(dm)[-(1:3)])))
> dm
   id Gender Race F_A M_A F_B M_B F_H M_H F_W M_W
1   1      F    H   0   0   0   0   1   0   0   0
2   2      M    H   0   0   0   0   0   1   0   0
3   3      M    W   0   0   0   0   0   0   0   1
4   4      F    H   0   0   0   0   1   0   0   0
5   5      M    H   0   0   0   0   0   1   0   0
6   6      F    H   0   0   0   0   1   0   0   0
7   7      F    H   0   0   0   0   1   0   0   0
8   8      M    A   0   1   0   0   0   0   0   0
9   9      M    W   0   0   0   0   0   0   0   1
10 10      F    B   0   0   1   0   0   0   0   0

如果您关心列顺序....

【讨论】：

【解决方案2】：

xtabs 的另一个基本 R 选项

cbind(
    df,
    as.data.frame.matrix(
        xtabs(
            ~ id + q,
            transform(
                df,
                q = paste0(Gender, "_", Race)
            )
        )
    )
)

给予

  id Gender Race F_B F_H F_W M_A M_B M_W
1  1      F    W   0   0   1   0   0   0
2  2      F    B   1   0   0   0   0   0
3  3      M    A   0   0   0   1   0   0
4  4      F    B   1   0   0   0   0   0
5  5      M    W   0   0   0   0   0   1
6  6      M    B   0   0   0   0   1   0
7  7      F    H   0   1   0   0   0   0

【讨论】：

【解决方案3】：

base R 选项与table

 cbind(df1, as.data.frame.matrix(table(transform(df1, 
    GenderRace = paste(Gender, Race, sep = "_"))[c("id", "GenderRace")])))
  id Gender Race F_B F_H F_W M_A M_B M_W
1  1      F    W   0   0   1   0   0   0
2  2      F    B   1   0   0   0   0   0
3  3      M    A   0   0   0   1   0   0
4  4      F    B   1   0   0   0   0   0
5  5      M    W   0   0   0   0   0   1
6  6      M    B   0   0   0   0   1   0
7  7      F    H   0   1   0   0   0   0

数据

df1 <- structure(list(id = 1:7, Gender = c("F", "F", "M", "F", "M", 
"M", "F"), Race = c("W", "B", "A", "B", "W", "B", "H")), 
class = "data.frame", row.names = c(NA, 
-7L))

【讨论】：

【解决方案4】：

我认为您可以使用以下解决方案。它实际上比您想要的输出少 2 个变量，但输出将为零。由于pivot_wider会传播数据集中可以找到的所有组合。

library(dplyr)
library(tidyr)

df %>%
  mutate(grp = 1) %>%
  pivot_wider(names_from = c(Gender, Race), values_from = grp, 
              values_fill = 0, names_glue = "{Gender}_{Race}") %>%
  right_join(df, by = "id") %>%
  relocate(id, Gender, Race)

# A tibble: 7 x 9
     id Gender Race    F_W   F_B   M_A   M_W   M_B   F_H
  <int> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1 F      W         1     0     0     0     0     0
2     2 F      B         0     1     0     0     0     0
3     3 M      A         0     0     1     0     0     0
4     4 F      B         0     1     0     0     0     0
5     5 M      W         0     0     0     1     0     0
6     6 M      B         0     0     0     0     1     0
7     7 F      H         0     0     0     0     0     1

【讨论】：

很好，你没有删除你的解决方案！
是的，我对此有不好的预感。这不是一个错误的答案。
谢谢你！

【解决方案5】：

除了 Anoushiravan R 的 tidyverse 解决方案。这是unite、pivot_wider、across 和case_when 的另一个选项

library(tidyverse)
  df %>% 
    unite(comb, Gender:Race, remove = FALSE) %>% 
    pivot_wider(
      names_from = comb,
      values_from = comb
    ) %>% 
    mutate(across(c(F_W, F_B, M_A, M_W, M_B, F_H), 
                  ~ case_when(is.na(.) ~ 0, 
                              TRUE ~ 1)))

输出：

  id    Gender Race    F_W   F_B   M_A   M_W   M_B   F_H
  <chr> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1     F      W         1     0     0     0     0     0
2 2     F      B         0     1     0     0     0     0
3 3     M      A         0     0     1     0     0     0
4 4     F      B         0     1     0     0     0     0
5 5     M      W         0     0     0     1     0     0
6 6     M      B         0     0     0     0     1     0
7 7     F      H         0     0     0     0     0     1

【讨论】：