【问题标题】:How to create dummies based on two columns in R如何基于 R 中的两列创建假人
【发布时间】:2021-07-17 13:35:34
【问题描述】:

假设我有一个数据框: 性别可以取 F 为女性或 M 为男性 种族可以把 A 作为亚洲人,W 作为白人,B 作为黑人,H 作为西班牙裔

| id | Gender | Race |
| --- | ----- | ---- |
| 1   | F    | W |
| 2   | F    | B |
| 3   | M    | A |
| 4   | F    | B |
| 5   | M    | W |
| 6   | M    | B |
| 7   | F    | H |

我想有一组基于性别和种族的列作为虚拟对象,数据框应该是这样的

| id | Gender | Race | F_W | F_B | F_A | F_H | M_W | M_B | M_A | M_H |
| --- | ----- | ---- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1   | F    | W   |  1  |  0  |  0  |  0  |  0  |  0  |  0  |  0  |
| 2   | F    | B   |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
| 3   | M    | A   |  0  |  0  |  0  |  0  |  0  |  0  |  1  |  0  |
| 4   | F    | B   |  0  |  1  |  0  |  0  |  0  |  0  |  0  |  0  |
| 5   | M    | W   |  0  |  0  |  0  |  0  |  1  |  0  |  0  |  0  |
| 6   | M    | B   |  0  |  0  |  0  |  0  |  0  |  1  |  0  |  0  |
| 7   | F    | H   |  0  |  0  |  0  |  1  |  0  |  0  |  0  |  0  |

我的实际数据包含的类别比此示例多得多,因此如果您能以更简洁的方式制作它,我将不胜感激。 语言是R。 感谢您的帮助。

【问题讨论】:

    标签: r dummy-variable


    【解决方案1】:

    除了列名之外,您还可以使用 model.matrix 函数和一个仅表示交互项并减去截距的公式来获得:

    > dm = cbind(d,model.matrix(~Gender:Race-1, data=d))
    > dm
       id Gender Race GenderF:RaceA GenderM:RaceA GenderF:RaceB GenderM:RaceB
    1   1      F    H             0             0             0             0
    2   2      M    H             0             0             0             0
    3   3      M    W             0             0             0             0
    4   4      F    H             0             0             0             0
    5   5      M    H             0             0             0             0
    [etc]
    

    如果您关心确切的名称,可以很容易地通过一些字符串处理来对它们进行排序。

    > names(dm)[-(1:3)] = sub("Gender","",sub("Race","",sub(":","_",names(dm)[-(1:3)])))
    > dm
       id Gender Race F_A M_A F_B M_B F_H M_H F_W M_W
    1   1      F    H   0   0   0   0   1   0   0   0
    2   2      M    H   0   0   0   0   0   1   0   0
    3   3      M    W   0   0   0   0   0   0   0   1
    4   4      F    H   0   0   0   0   1   0   0   0
    5   5      M    H   0   0   0   0   0   1   0   0
    6   6      F    H   0   0   0   0   1   0   0   0
    7   7      F    H   0   0   0   0   1   0   0   0
    8   8      M    A   0   1   0   0   0   0   0   0
    9   9      M    W   0   0   0   0   0   0   0   1
    10 10      F    B   0   0   1   0   0   0   0   0
    

    如果您关心列顺序....

    【讨论】:

      【解决方案2】:

      xtabs 的另一个基本 R 选项

      cbind(
          df,
          as.data.frame.matrix(
              xtabs(
                  ~ id + q,
                  transform(
                      df,
                      q = paste0(Gender, "_", Race)
                  )
              )
          )
      )
      

      给予

        id Gender Race F_B F_H F_W M_A M_B M_W
      1  1      F    W   0   0   1   0   0   0
      2  2      F    B   1   0   0   0   0   0
      3  3      M    A   0   0   0   1   0   0
      4  4      F    B   1   0   0   0   0   0
      5  5      M    W   0   0   0   0   0   1
      6  6      M    B   0   0   0   0   1   0
      7  7      F    H   0   1   0   0   0   0
      

      【讨论】:

        【解决方案3】:

        base R 选项与table

         cbind(df1, as.data.frame.matrix(table(transform(df1, 
            GenderRace = paste(Gender, Race, sep = "_"))[c("id", "GenderRace")])))
          id Gender Race F_B F_H F_W M_A M_B M_W
        1  1      F    W   0   0   1   0   0   0
        2  2      F    B   1   0   0   0   0   0
        3  3      M    A   0   0   0   1   0   0
        4  4      F    B   1   0   0   0   0   0
        5  5      M    W   0   0   0   0   0   1
        6  6      M    B   0   0   0   0   1   0
        7  7      F    H   0   1   0   0   0   0
        

        数据

        df1 <- structure(list(id = 1:7, Gender = c("F", "F", "M", "F", "M", 
        "M", "F"), Race = c("W", "B", "A", "B", "W", "B", "H")), 
        class = "data.frame", row.names = c(NA, 
        -7L))
        

        【讨论】:

          【解决方案4】:

          我认为您可以使用以下解决方案。它实际上比您想要的输出少 2 个变量,但输出将为零。由于pivot_wider会传播数据集中可以找到的所有组合。

          library(dplyr)
          library(tidyr)
          
          df %>%
            mutate(grp = 1) %>%
            pivot_wider(names_from = c(Gender, Race), values_from = grp, 
                        values_fill = 0, names_glue = "{Gender}_{Race}") %>%
            right_join(df, by = "id") %>%
            relocate(id, Gender, Race)
          
          # A tibble: 7 x 9
               id Gender Race    F_W   F_B   M_A   M_W   M_B   F_H
            <int> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
          1     1 F      W         1     0     0     0     0     0
          2     2 F      B         0     1     0     0     0     0
          3     3 M      A         0     0     1     0     0     0
          4     4 F      B         0     1     0     0     0     0
          5     5 M      W         0     0     0     1     0     0
          6     6 M      B         0     0     0     0     1     0
          7     7 F      H         0     0     0     0     0     1
          

          【讨论】:

          • 很好,你没有删除你的解决方案!
          • 是的,我对此有不好的预感。这不是一个错误的答案。
          • 谢谢你!
          【解决方案5】:

          除了 Anoushiravan R 的 tidyverse 解决方案。 这是unitepivot_wideracrosscase_when 的另一个选项

          library(tidyverse)
            df %>% 
              unite(comb, Gender:Race, remove = FALSE) %>% 
              pivot_wider(
                names_from = comb,
                values_from = comb
              ) %>% 
              mutate(across(c(F_W, F_B, M_A, M_W, M_B, F_H), 
                            ~ case_when(is.na(.) ~ 0, 
                                        TRUE ~ 1)))
          

          输出:

            id    Gender Race    F_W   F_B   M_A   M_W   M_B   F_H
            <chr> <chr>  <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
          1 1     F      W         1     0     0     0     0     0
          2 2     F      B         0     1     0     0     0     0
          3 3     M      A         0     0     1     0     0     0
          4 4     F      B         0     1     0     0     0     0
          5 5     M      W         0     0     0     1     0     0
          6 6     M      B         0     0     0     0     1     0
          7 7     F      H         0     0     0     0     0     1
          

          【讨论】:

            猜你喜欢
            • 1970-01-01
            • 2021-12-08
            • 2016-05-10
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 1970-01-01
            • 2022-06-14
            相关资源
            最近更新 更多