【问题标题】:Recode based on lookup table across multiple variables基于跨多个变量的查找表重新编码
【发布时间】:2019-03-20 09:14:40
【问题描述】:

我正在尝试重新编码一些列,每个列都有不同的重新编码规则。据我所知,dplyr::recode() 不接受向量。最好的解决方案是 tidyverse 而不是一堆嵌套循环!

这是示例数据和查找表:

x <-structure(list(MAIN = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                            1L, 1L, 1L, 0L, 1L, 1L, 1L, 1L, 1L, 1L), PREDDEG = c(3L, 3L, 
                                                                                 3L, 3L, 3L, 3L, 2L, 3L, 3L, 3L, 3L, 2L, 3L, 3L, 2L, 2L, 3L, 1L, 
                                                                                 1L, 2L), HIGHDEG = c(4L, 4L, 4L, 4L, 4L, 4L, 2L, 3L, 4L, 4L, 
                                                                                                      3L, 2L, 3L, 4L, 2L, 2L, 4L, 2L, 1L, 2L), CONTROL = c(1L, 1L, 
                                                                                                                                                           2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 2L, 3L, 1L, 1L, 2L, 1L, 
                                                                                                                                                           3L, 1L), not_to_recode = c("asdf", "asdf", "asdf", "asdf", "asdf", 
                                                                                                                                                                                      "asdf", "asdf", "asdf", "asdf", "asdf", "asdf", "asdf", "asdf", 
                                                                                                                                                                                      "asdf", "asdf", "asdf", "asdf", "asdf", "asdf", "asdf")), row.names = c(NA, 
                                                                                                                                                                                                                                                              -20L), class = c("tbl_df", "tbl", "data.frame"))
x
#>    MAIN PREDDEG HIGHDEG CONTROL not_to_recode
#> 1     1       3       4       1          asdf
#> 2     1       3       4       1          asdf
#> 3     1       3       4       2          asdf
#> 4     1       3       4       1          asdf
#> 5     1       3       4       1          asdf
#> 6     1       3       4       1          asdf
#> 7     1       2       2       1          asdf
#> 8     1       3       3       1          asdf
#> 9     1       3       4       1          asdf
#> 10    1       3       4       1          asdf
#> 11    1       3       3       2          asdf
#> 12    1       2       2       1          asdf
#> 13    1       3       3       2          asdf
#> 14    0       3       4       3          asdf
#> 15    1       2       2       1          asdf
#> 16    1       2       2       1          asdf
#> 17    1       3       4       2          asdf
#> 18    1       1       2       1          asdf
#> 19    1       1       1       3          asdf
#> 20    1       2       2       1          asdf


lookup <- structure(list(variable_name = c("MAIN", "MAIN", "PREDDEG", "PREDDEG", "PREDDEG", "PREDDEG", "PREDDEG", "HIGHDEG", "HIGHDEG", "HIGHDEG","HIGHDEG", "HIGHDEG", "CONTROL", "CONTROL", "CONTROL"), 
                         value = c(0, 1, 0, 1, 2, 3, 4, 0, 1, 2, 3, 4, 1, 2, 3), 
                         label = c("Not main campus", "Main campus", "Not classified", "Predominantly certificate-degree granting", "Predominantly associate's-degree granting", "Predominantly bachelor's-degree granting", "Entirely graduate-degree granting", "Non-degree-granting", "Certificate degree", "Associate degree", "Bachelor's degree", "Graduate degree", "Public", "Private nonprofit", "Private for-profit")), 
                    row.names = c(NA, -15L), class = c("tbl_df", "tbl", "data.frame"))

lookup
#>    variable_name value                                     label
#> 1           MAIN     0                           Not main campus
#> 2           MAIN     1                               Main campus
#> 3        PREDDEG     0                            Not classified
#> 4        PREDDEG     1 Predominantly certificate-degree granting
#> 5        PREDDEG     2 Predominantly associate's-degree granting
#> 6        PREDDEG     3  Predominantly bachelor's-degree granting
#> 7        PREDDEG     4         Entirely graduate-degree granting
#> 8        HIGHDEG     0                       Non-degree-granting
#> 9        HIGHDEG     1                        Certificate degree
#> 10       HIGHDEG     2                          Associate degree
#> 11       HIGHDEG     3                         Bachelor's degree
#> 12       HIGHDEG     4                           Graduate degree
#> 13       CONTROL     1                                    Public
#> 14       CONTROL     2                         Private nonprofit
#> 15       CONTROL     3                        Private for-profit

reprex package (v0.2.1) 于 2018 年 10 月 15 日创建

【问题讨论】:

    标签: r dplyr


    【解决方案1】:

    variable_name 拆分长形式的查找,并在x 内按names 对其进行排序

    slook <- split(lookup[-1], lookup$variable_name)[names(x)]
    

    然后使用mapply 进行表查找,仅限于每个变量中的值:

     mapply(function(a,b){  b[['label']][match(a, b$value)]}, x, slook)
          MAIN              PREDDEG                                     HIGHDEG              CONTROL             
     [1,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
     [2,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
     [3,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Private nonprofit" 
     [4,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
     [5,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
     [6,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
     [7,] "Main campus"     "Predominantly associate's-degree granting" "Associate degree"   "Public"            
     [8,] "Main campus"     "Predominantly bachelor's-degree granting"  "Bachelor's degree"  "Public"            
     [9,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
    [10,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Public"            
    [11,] "Main campus"     "Predominantly bachelor's-degree granting"  "Bachelor's degree"  "Private nonprofit" 
    [12,] "Main campus"     "Predominantly associate's-degree granting" "Associate degree"   "Public"            
    [13,] "Main campus"     "Predominantly bachelor's-degree granting"  "Bachelor's degree"  "Private nonprofit" 
    [14,] "Not main campus" "Predominantly bachelor's-degree granting"  "Graduate degree"    "Private for-profit"
    [15,] "Main campus"     "Predominantly associate's-degree granting" "Associate degree"   "Public"            
    [16,] "Main campus"     "Predominantly associate's-degree granting" "Associate degree"   "Public"            
    [17,] "Main campus"     "Predominantly bachelor's-degree granting"  "Graduate degree"    "Private nonprofit" 
    [18,] "Main campus"     "Predominantly certificate-degree granting" "Associate degree"   "Public"            
    [19,] "Main campus"     "Predominantly certificate-degree granting" "Certificate degree" "Private for-profit"
    [20,] "Main campus"     "Predominantly associate's-degree granting" "Associate degree"   "Public"            
    

    为了解决对与任务不匹配的示例提出的问题:可以将分配限制为仅名称与 lookup 对象中存在的名称相同的列:

     x[ , names(slook)] <-  mapply(
       function(a,b){  b[['label']][  # the character label col
                             match(a, b$value) ]},  # lookup x-index in slook 
                         # end function call, now the arguments
                       x[names(slook)], # arg matched to `a`
                       slook,   #arg gets matched to `b`
                       SIMPLIFY=FALSE)  # keep it a list rather than make a matrix
    > x
    # A tibble: 20 x 5
       MAIN            PREDDEG                                   HIGHDEG            CONTROL            not_to_recode
       <chr>           <chr>                                     <chr>              <chr>              <chr>        
     1 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
     2 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
     3 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Private nonprofit  asdf         
     4 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
     5 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
     6 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
     7 Main campus     Predominantly associate's-degree granting Associate degree   Public             asdf         
     8 Main campus     Predominantly bachelor's-degree granting  Bachelor's degree  Public             asdf         
     9 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
    10 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Public             asdf         
    11 Main campus     Predominantly bachelor's-degree granting  Bachelor's degree  Private nonprofit  asdf         
    12 Main campus     Predominantly associate's-degree granting Associate degree   Public             asdf         
    13 Main campus     Predominantly bachelor's-degree granting  Bachelor's degree  Private nonprofit  asdf         
    14 Not main campus Predominantly bachelor's-degree granting  Graduate degree    Private for-profit asdf         
    15 Main campus     Predominantly associate's-degree granting Associate degree   Public             asdf         
    16 Main campus     Predominantly associate's-degree granting Associate degree   Public             asdf         
    17 Main campus     Predominantly bachelor's-degree granting  Graduate degree    Private nonprofit  asdf         
    18 Main campus     Predominantly certificate-degree granting Associate degree   Public             asdf         
    19 Main campus     Predominantly certificate-degree granting Certificate degree Private for-profit asdf         
    20 Main campus     Predominantly associate's-degree granting Associate degree   Public             asdf         
    

    如果你想模仿mapply 的动作,我相信在tidyverse 轨道上的purrr-package 也提供了类似的功能。具体来说,您应该查看map2

     help(map2, pac=purrr)  # attention to `pmap`
    

    【讨论】:

    • 谢谢。当使用包含未重新编码的列的完整数据集时,此操作会失败。我已经使用非重新编码列更新了我的示例数据。另外,对于其他人,我希望能够在 tidyverse 而不是 base 中做到这一点,因为很难解开 base R 应用函数和索引正在发生的事情。
    • 你可以用Map代替mapply,也可以跳过SIMPLIFY=FALSE
    • 同样,将 b[[2]] 替换为 b$label 可能会使子集输出的内容更加明确。
    • 我朝着你上次建议的方向走了一半。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2018-04-21
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多