如何将多个变量的值与查找表中的变量匹配？答案

【问题标题】：How to match values of several variables to a variable in a look up table?如何将多个变量的值与查找表中的变量匹配？
【发布时间】：2018-10-25 09:30:27
【问题描述】：

我有两个数据集：

loc <- c("a","b","c","d","e")
id1 <- c(NA,9,3,4,5)
id2 <- c(2,3,7,5,6)
id3 <- c(2,NA,5,NA,7)
cost1 <- c(10,20,30,40,50)
cost2 <- c(50,20,30,30,50)
cost3 <- c(40,20,30,10,20)
dt <- data.frame(loc,id1,id2,id3,cost1,cost2,cost3)


id <- c(1,2,3,4,5,6,7)
rate <- c(0.9,0.8,0.7,0.6,0.5,0.4,0.3)
lookupd_tb <- data.frame(id,rate)

我想做的是将 dt 中的值与 id1、id2 和 id3 的 lookup_tb 进行匹配，如果匹配，则将该 id 的比率乘以其相关成本。

这是我的方法：

dt <- dt %>% 
left_join(lookupd_tb , by=c("id1"="id")) %>%
dplyr :: mutate(cost1 = ifelse(!is.na(rate), cost1*rate, cost1)) %>% 
dplyr :: select (-rate)

我现在在做什么，工作正常，但我必须为每个变量重复 3 次，我想知道是否有更有效的方法来做到这一点（可能使用 apply 系列？）

我尝试在我的查找表中加入所有三个带有 id 的变量，但是当我的 dt 加入 rate 时，所有成本（cost1、cost2 和 cost3）都将乘以我不想要的相同速率。

感谢您的帮助！

【问题讨论】：

标签： r dplyr apply lookup

【解决方案1】：

base R 的方法是使用 sapply/lapply 循环遍历 'id' 的列，根据索引从 'lookupd_tb' 的 'id' 列中获取 matching 索引，获取相应的'rate', replace NA 元素加 1，乘以 'cost' 列并更新 'cost' 列

nmid <- grep("id", names(dt))
nmcost <- grep("cost", names(dt))

dt[nmcost] <- dt[nmcost]*sapply(dt[nmid], function(x) {
         x1 <- lookupd_tb$rate[match(x, lookupd_tb$id)]
          replace(x1, is.na(x1), 1) })

或者使用tidyverse，我们可以使用purrr::map2遍历列集，即'id'和'cost'，然后执行与上述相同的方法。唯一的区别是我们在这里创建了新列而不是更新“成本”列

library(tidyverse)
dt %>% 
   select(nmid) %>% 
   map2_df(., dt %>% 
               select(nmcost), ~  
                 .x %>% 
                     match(., lookupd_tb$id) %>%
                     lookupd_tb$rate[.] %>% 
                     replace(., is.na(.),1) * .y ) %>%
    rename_all(~ paste0("costnew", seq_along(.))) %>%
    bind_cols(dt, .)

【讨论】：

@EmmaNej 那是根据你的 oifelse 声明
我想要原始值，您的方法效果很好！谢谢！
您能解释一下代码中的占位符 (.) 和 ~ 和 .x .y 吗？我很难理解这些！
@EmmaNej map2 接受两个参数，当我们与~ 一起使用时，我们提取这些参数中的值的方式是.x 和.y，顺序相同。即.x 将是dt %>% select(nmid) 和.y 将是dt %>% select(nmcost) 在其他和这些循环通过列，因此'dt [nmid]` 的每个对应列在dt[nmcost] 中都有对应的列
@EmmaNej 我认为这是因为我们把它做成了太多的管道。为了更好地理解，试试map2(dt[nmid], dt[nmcost], ~ .x)和map2(dt[nmid], dt[nmcost], ~ .y)会更清楚

【解决方案2】：

在tidyverse，您还可以尝试另一种方法，将数据从宽转换为长

  library(tidyverse)
  dt %>% 
  # data transformation to long
  gather(k, v, -loc) %>% 
  mutate(ID=paste0("costnew", str_extract(k, "[:digit:]")),
         k=str_remove(k, "[:digit:]")) %>% 
  spread(k, v) %>% 
  # left_join and calculations of new costs
  left_join(lookupd_tb , by="id") %>% 
  mutate(cost_new=ifelse(is.na(rate), cost,rate*cost)) %>% 
  #  clean up and expected output
  select(loc, ID, cost_new) %>% 
  spread(ID, cost_new) %>% 
  left_join(dt,., by="loc")  # or %>% bind_cols(dt, .)
  loc id1 id2 id3 cost1 cost2 cost3 costnew1 costnew2 costnew3
1   a  NA   2   2    10    50    40       10       40       32
2   b   9   3  NA    20    20    20       20       14       20
3   c   3   7   5    30    30    30       21        9       15
4   d   4   5  NA    40    30    10       24       15       10
5   e   5   6   7    50    50    20       25       20        6

想法是使用gather 和spread 与新的索引列k 和ID 组合为lef_joining 带来合适的长格式数据。计算后，我们将使用第二个spread 转换为预期输出并绑定到dt

【讨论】：

感谢您的回复，但结果不正确。我想根据新的费率更改 cost2 和 cost3 ，这样的附加费率会导致错误的结果。
是的，这就够了！谢谢！