基于跨多个变量的查找表重新编码答案

【问题标题】：Recode based on lookup table across multiple variables基于跨多个变量的查找表重新编码
【发布时间】：2019-03-20 09:14:40
【问题描述】：

我正在尝试重新编码一些列，每个列都有不同的重新编码规则。据我所知，dplyr::recode() 不接受向量。最好的解决方案是 tidyverse 而不是一堆嵌套循环！

这是示例数据和查找表：

x <-structure(list(MAIN = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                            1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L), PREDDEG = c(3L, 3L, 
                                                                                 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 1L, 
                                                                                 1L, 2L), HIGHDEG = c(4L, 4L, 4L, 4L, 4L, 4L, 2L, 3L, 4L, 4L, 
                                                                                                      3L, 2L, 3L, 4L, 2L, 2L, 4L, 2L, 1L, 2L), CONTROL = c(1L, 1L, 
                                                                                                                                                           2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L, 1L, 
                                                                                                                                                           3L, 1L), not_to_recode = c("asdf", "asdf", "asdf", "asdf", "asdf", 
                                                                                                                                                                                      "asdf", "asdf", "asdf", "asdf", "asdf", "asdf", "asdf", "asdf", 
                                                                                                                                                                                      "asdf", "asdf", "asdf", "asdf", "asdf", "asdf", "asdf")), row.names = c(NA, 
                                                                                                                                                                                                                                                              -20L), class = c("tbl_df", "tbl", "data.frame"))
x
#>    MAIN PREDDEG HIGHDEG CONTROL not_to_recode
#> 1     1       3       4       1          asdf
#> 2     1       3       4       1          asdf
#> 3     1       3       4       2          asdf
#> 4     1       3       4       1          asdf
#> 5     1       3       4       1          asdf
#> 6     1       3       4       1          asdf
#> 7     1       2       2       1          asdf
#> 8     1       3       3       1          asdf
#> 9     1       3       4       1          asdf
#> 10    1       3       4       1          asdf
#> 11    1       3       3       2          asdf
#> 12    1       2       2       1          asdf
#> 13    1       3       3       2          asdf
#> 14    0       3       4       3          asdf
#> 15    1       2       2       1          asdf
#> 16    1       2       2       1          asdf
#> 17    1       3       4       2          asdf
#> 18    1       1       2       1          asdf
#> 19    1       1       1       3          asdf
#> 20    1       2       2       1          asdf


lookup <- structure(list(variable_name = c("MAIN", "MAIN", "PREDDEG", "PREDDEG", "PREDDEG", "PREDDEG", "PREDDEG", "HIGHDEG", "HIGHDEG", "HIGHDEG","HIGHDEG", "HIGHDEG", "CONTROL", "CONTROL", "CONTROL"), 
                         value = c(0, 1, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3), 
                         label = c("Not main campus", "Main campus", "Not classified", "Predominantly certificate-degree granting", "Predominantly associate's-degree granting", "Predominantly bachelor's-degree granting", "Entirely graduate-degree granting", "Non-degree-granting", "Certificate degree", "Associate degree", "Bachelor's degree", "Graduate degree", "Public", "Private nonprofit", "Private for-profit")), 
                    row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"))

lookup
#>    variable_name value                                     label
#> 1           MAIN     0                           Not main campus
#> 2           MAIN     1                               Main campus
#> 3        PREDDEG     0                            Not classified
#> 4        PREDDEG     1 Predominantly certificate-degree granting
#> 5        PREDDEG     2 Predominantly associate's-degree granting
#> 6        PREDDEG     3  Predominantly bachelor's-degree granting
#> 7        PREDDEG     4         Entirely graduate-degree granting
#> 8        HIGHDEG     0                       Non-degree-granting
#> 9        HIGHDEG     1                        Certificate degree
#> 10       HIGHDEG     2                          Associate degree
#> 11       HIGHDEG     3                         Bachelor's degree
#> 12       HIGHDEG     4                           Graduate degree
#> 13       CONTROL     1                                    Public
#> 14       CONTROL     2                         Private nonprofit
#> 15       CONTROL     3                        Private for-profit

^{由reprex package (v0.2.1) 于 2018 年 10 月 15 日创建}

【问题讨论】：

标签： r dplyr

【解决方案1】：

按variable_name 拆分长形式的查找，并在x 内按names 对其进行排序

slook <- split(lookup[-1], lookup$variable_name)[names(x)]

然后使用mapply 进行表查找，仅限于每个变量中的值：

 mapply(function(a,b){  b[['label']][match(a, b$value)]}, x, slook)
      MAIN              PREDDEG                                     HIGHDEG              CONTROL             
 [1,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
 [2,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
 [3,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Private nonprofit" 
 [4,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
 [5,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
 [6,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
 [7,] "Main campus"     "Predominantly associate's-degree granting" "Associate degree"   "Public"            
 [8,] "Main campus"     "Predominantly bachelor's-degree granting"  "Bachelor's degree"  "Public"            
 [9,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
[10,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
[11,] "Main campus"     "Predominantly bachelor's-degree granting"  "Bachelor's degree"  "Private nonprofit" 
[12,] "Main campus"     "Predominantly associate's-degree granting" "Associate degree"   "Public"            
[13,] "Main campus"     "Predominantly bachelor's-degree granting"  "Bachelor's degree"  "Private nonprofit" 
[14,] "Not main campus" "Predominantly bachelor's-degree granting"  "Graduate degree"    "Private for-profit"
[15,] "Main campus"     "Predominantly associate's-degree granting" "Associate degree"   "Public"            
[16,] "Main campus"     "Predominantly associate's-degree granting" "Associate degree"   "Public"            
[17,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Private nonprofit" 
[18,] "Main campus"     "Predominantly certificate-degree granting" "Associate degree"   "Public"            
[19,] "Main campus"     "Predominantly certificate-degree granting" "Certificate degree" "Private for-profit"
[20,] "Main campus"     "Predominantly associate's-degree granting" "Associate degree"   "Public"

为了解决对与任务不匹配的示例提出的问题：可以将分配限制为仅名称与 lookup 对象中存在的名称相同的列：

 x[ , names(slook)] <-  mapply(
   function(a,b){  b[['label']][  # the character label col
                         match(a, b$value) ]},  # lookup x-index in slook 
                     # end function call, now the arguments
                   x[names(slook)], # arg matched to `a`
                   slook,   #arg gets matched to `b`
                   SIMPLIFY=FALSE)  # keep it a list rather than make a matrix
> x
# A tibble: 20 x 5
   MAIN            PREDDEG                                   HIGHDEG            CONTROL            not_to_recode
   <chr>           <chr>                                     <chr>              <chr>              <chr>        
 1 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
 2 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
 3 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Private nonprofit  asdf         
 4 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
 5 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
 6 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
 7 Main campus     Predominantly associate's-degree granting Associate degree   Public             asdf         
 8 Main campus     Predominantly bachelor's-degree granting  Bachelor's degree  Public             asdf         
 9 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
10 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
11 Main campus     Predominantly bachelor's-degree granting  Bachelor's degree  Private nonprofit  asdf         
12 Main campus     Predominantly associate's-degree granting Associate degree   Public             asdf         
13 Main campus     Predominantly bachelor's-degree granting  Bachelor's degree  Private nonprofit  asdf         
14 Not main campus Predominantly bachelor's-degree granting  Graduate degree    Private for-profit asdf         
15 Main campus     Predominantly associate's-degree granting Associate degree   Public             asdf         
16 Main campus     Predominantly associate's-degree granting Associate degree   Public             asdf         
17 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Private nonprofit  asdf         
18 Main campus     Predominantly certificate-degree granting Associate degree   Public             asdf         
19 Main campus     Predominantly certificate-degree granting Certificate degree Private for-profit asdf         
20 Main campus     Predominantly associate's-degree granting Associate degree   Public             asdf

如果你想模仿mapply 的动作，我相信在tidyverse 轨道上的purrr-package 也提供了类似的功能。具体来说，您应该查看map2：

 help(map2, pac=purrr)  # attention to `pmap`

【讨论】：

谢谢。当使用包含未重新编码的列的完整数据集时，此操作会失败。我已经使用非重新编码列更新了我的示例数据。另外，对于其他人，我希望能够在 tidyverse 而不是 base 中做到这一点，因为很难解开 base R 应用函数和索引正在发生的事情。
你可以用Map代替mapply，也可以跳过SIMPLIFY=FALSE
同样，将 b[[2]] 替换为 b$label 可能会使子集输出的内容更加明确。
我朝着你上次建议的方向走了一半。