将多个虚拟/逻辑变量转换为 R dplyr 中的单个分类变量答案

【问题标题】：Convert multiple dummy/logical variables into a single categorical variable in R dplyr将多个虚拟/逻辑变量转换为 R dplyr 中的单个分类变量
【发布时间】：2022-01-04 14:07:31
【问题描述】：

我有一个与one 类似的问题。我想根据它们在 R 中的名称将各种虚拟/逻辑变量转换为单个分类变量/因子。我的问题不同，因为可能有许多变量分组需要编码。例如本例中的age 和chol_test。这只是我的数据框的一个子集。还有diabetes_test等其他变量也需要转换，所以我不能只做starts_with("condition")。

我想将低位编码为 1，中位编码为 2，高位编码为 3。如果所有编码变量均为 0，则保留为 N/A。

list(low = 1, medium = 2, high = 3)

数据基本上是这样的：

输入

  race  gender age.low_tm1 age.medium_tm1 age.high_tm1 chol_test.low_tm1 chol_test.high_tm1
  <chr>  <int>       <int>          <int>        <int>             <int>              <int>
1 white      0           1              0            0                 0                  0
2 white      0           1              0            0                 0                  0
3 white      1           1              0            0                 0                  0
4 black      1           0              1            0                 0                  0
5 white      0           0              0            1                 0                  1
6 black      0           0              1            0                 1                  0

我希望输出如下所示：

预期输出：

  race  gender   age  chol_test
1 white      0     1        n/a  
2 white      0     1        n/a
3 white      1     1        n/a
4 black      1     2        n/a
5 white      0     3          3
6 black      0     2          1

我怎么能这样做？如果可能，我正在寻找与我使用 dplyr 链接的问题中发布的解决方案类似的解决方案。很抱歉有任何冗余。

数据

df <- structure(list(race = c("white", "white", "white", "black", "white", 
"black"), gender = c(0L, 0L, 1L, 1L, 0L, 0L), age.low_tm1 = c(1L, 
1L, 1L, 0L, 0L, 0L), age.medium_tm1 = c(0L, 0L, 0L, 1L, 0L, 1L
), age.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 0L), chol_test.low_tm1 = c(0L, 
0L, 0L, 0L, 0L, 1L), chol_test.high_tm1 = c(0L, 0L, 0L, 0L, 1L, 
0L)), class = "data.frame", row.names = c("1", "2", "3", "4", 
"5", "6"))

【问题讨论】：

标签： r dplyr

【解决方案1】：

我会这样做

df %>% 
  mutate(id = row_number()) %>%
  pivot_longer(cols = -c(race, gender, id)) %>%
  filter(value > 0) %>%
  separate(name, c("var", "range1"), sep = '\\.') %>%
  mutate(
    value = case_when(
      range1 == 'low_tm1' ~ 1, 
      range1 == 'medium_tm1' ~ 2, 
      range1 == 'high_tm1' ~ 3, 
    )
  ) %>%
  select(-range1) %>%
  pivot_wider(names_from = var, values_from = value) %>%
  select(-id)

  race  gender   age chol_test
  <chr>  <int> <dbl>     <dbl>
1 white      0     1        NA
2 white      0     1        NA
3 white      1     1        NA
4 black      1     2        NA
5 white      0     3         3
6 black      0     2         1

【讨论】：

这太棒了！我得到一个“值不是唯一标识的；输出将包含 list-cols。”然而警告。你知道为什么会这样吗？对于 chol_test，我得到具有完整数据集的每一行的 c(1,3,NA) 值。
大概在pivot_wider。在此之前运行吗？如果是这样，是否有任何行说chol_test.low_tm1 = 1 和chol_test.medium_tm1 = 1？
否，但在某些情况下，所有 chol_test 都可以为 0，因此我认为您的过滤器（值 >0）会删除一些 id..
比如有一行age和chol_test全为0，去掉过滤条件，就会出现警告。
过滤器需要被包含，它是否与它一起运行？