根据查找表替换数据框中的值答案

【问题标题】：Replace values in a dataframe based on lookup table根据查找表替换数据框中的值
【发布时间】：2016-06-08 18:13:34
【问题描述】：

我在替换数据框中的值时遇到了一些问题。我想根据单独的表格替换值。下面是我正在尝试做的一个示例。

我有一张表格，其中每一行都是客户，每一列都是他们购买的动物。让我们将此数据框称为table。

> table
#       P1     P2     P3
# 1    cat lizard parrot
# 2 lizard parrot    cat
# 3 parrot    cat lizard

我还有一个名为 lookUp 的表，我将引用它。

> lookUp
#      pet   class
# 1    cat  mammal
# 2 lizard reptile
# 3 parrot    bird

我想要做的是创建一个名为new 的新表，其函数将table 中的所有值替换为lookUp 中的class 列。我自己尝试使用 lapply 函数，但收到以下警告。

new <- as.data.frame(lapply(table, function(x) {
  gsub('.*', lookUp[match(x, lookUp$pet) ,2], x)}), stringsAsFactors = FALSE)

Warning messages:
1: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used
2: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used
3: In gsub(".*", lookUp[match(x, lookUp$pet), 2], x) :
  argument 'replacement' has length > 1 and only the first element will be used

关于如何完成这项工作的任何想法？

【问题讨论】：

应该能够在行和列索引的两列上使用 cbind 来做到这一点。看到了吗？"["

标签： r dataframe lookup

【解决方案1】：

我是使用 factor 内置的。

table$P1 <- factor(table$P1, levels=lookUp$pet, labels=lookUp$class)
table$P2 <- factor(table$P2, levels=lookUp$pet, labels=lookUp$class)
table$P3 <- factor(table$P3, levels=lookUp$pet, labels=lookUp$class)

【讨论】：

【解决方案2】：

我尝试了其他方法，但他们花了很长时间处理我非常大的数据集。我改为使用以下内容：

    # make table "new" using ifelse. See data below to avoid re-typing it
    new <- ifelse(table1 =="cat", "mammal",
                        ifelse(table1 == "lizard", "reptile",
                               ifelse(table1 =="parrot", "bird", NA)))

此方法需要您为代码编写更多文本，但ifelse 的矢量化使其运行速度更快。您必须根据您的数据来决定是要花更多时间编写代码还是等待计算机运行。如果您想确保它有效（您的iflese 命令中没有任何拼写错误），您可以使用apply(new, 2, function(x) mean(is.na(x)))。

数据

    # create the data table
    table1 <- read.table(text = "
       P1     P2     P3
     1    cat lizard parrot
     2 lizard parrot    cat
     3 parrot    cat lizard", header = TRUE)

【讨论】：

【解决方案3】：

任何时候你有两个独立的data.frames 并试图将信息从一个带到另一个，答案是合并。

每个人在 R 中都有自己喜欢的合并方法。我的是 data.table。

此外，由于您想对许多列执行此操作，因此使用 melt 和 dcast 会更快——而不是循环列，将其应用于重新调整的表格，然后再次重新调整。

library(data.table)

#the row names will be our ID variable for melting
setDT(table, keep.rownames = TRUE) 
setDT(lookUp)

#now melt, merge, recast
# melting (reshape wide to long)
table[ , melt(.SD, id.vars = 'rn')     
       # merging
       ][lookup, new_value := i.class, on = c(value = 'pet') 
         #reform back to original shape
         ][ , dcast(.SD, rn ~ variable, value.var = 'new_value')]
#    rn      P1      P2      P3
# 1:  1  mammal reptile    bird
# 2:  2 reptile    bird  mammal
# 3:  3    bird  mammal reptile

如果你觉得dcast/melt 有点吓人，这里有一种方法，它只循环列； dcast/melt 只是在回避这个问题的循环。

setDT(table) #don't need row names this time
setDT(lookUp)

sapply(names(table), #(or to whichever are the relevant columns)
       function(cc) table[lookUp, (cc) := #merge, replace
                            #need to pass a _named_ vector to 'on', so use setNames
                            i.class, on = setNames("pet", cc)])

【讨论】：

我真的很喜欢这种方法并在应用程序中使用它，但对结果的排序方式感到惊讶（这在少于 10 行的玩具示例中并不明显）。似乎因为存储在rn 中的行名是字符串，所以它们的排序类似于stings，即'10' 在'2' 之前，因此对结果进行了相应的排序。我丑陋的解决方法是添加最后一个排序步骤，我将 rn 强制转换为数字并对其进行排序，但我想知道是否有更规范的 data.table-y 方法来处理这个问题。
@DanielKessler data.tables 不保留行名——也许您想在尝试合并之前将rn 转换为数字？ table[ , rn := type.convert(rn)] 然后继续融化/合并/重铸

【解决方案4】：

答案above 显示如何在 dplyr 中执行此操作没有回答问题，表格中充满了 NA。这很有效，我将不胜感激任何 cmets 展示更好的方法：

# Add a customer column so that we can put things back in the right order
table$customer = seq(nrow(table))
classTable <- table %>% 
    # put in long format, naming column filled with P1, P2, P3 "petCount"
    gather(key="petCount", value="pet", -customer) %>% 
    # add a new column based on the pet's class in data frame "lookup"
    left_join(lookup, by="pet") %>%
    # since you wanted to replace the values in "table" with their
    # "class", remove the pet column
    select(-pet) %>% 
    # put data back into wide format
    spread(key="petCount", value="class")

请注意，保留包含客户、宠物、宠物的物种（？）及其类别的长表可能会很有用。此示例只是将中间保存添加到变量：

table$customer = seq(nrow(table))
petClasses <- table %>% 
    gather(key="petCount", value="pet", -customer) %>% 
    left_join(lookup, by="pet")

custPetClasses <- petClasses %>%
    select(-pet) %>% 
    spread(key="petCount", value="class")

【讨论】：

【解决方案5】：

创建一个命名向量，遍历每一列并匹配，见：

# make lookup vector with names
lookUp1 <- setNames(as.character(lookUp$class), lookUp$pet)
lookUp1    
#      cat    lizard    parrot 
# "mammal" "reptile"    "bird" 

# match on names get values from lookup vector
res <- data.frame(lapply(df1, function(i) lookUp1[i]))
# reset rownames
rownames(res) <- NULL

# res
#        P1      P2      P3
# 1  mammal reptile    bird
# 2 reptile    bird  mammal
# 3    bird  mammal reptile

数据

df1 <- read.table(text = "
       P1     P2     P3
 1    cat lizard parrot
 2 lizard parrot    cat
 3 parrot    cat lizard", header = TRUE)

lookUp <- read.table(text = "
      pet   class
 1    cat  mammal
 2 lizard reptile
 3 parrot    bird", header = TRUE)

【讨论】：

使用purrr包实现相同的想法，节省了一些击键：res <- purrr::map_df(df1, ~ lookUp1[.x])

【解决方案6】：

您在问题中发布了一个不错的方法。这是一个类似的方法：

new <- df  # create a copy of df
# using lapply, loop over columns and match values to the look up table. store in "new".
new[] <- lapply(df, function(x) look$class[match(x, look$pet)])

另一种更快的方法是：

new <- df
new[] <- look$class[match(unlist(df), look$pet)]

请注意，我在这两种情况下都使用空括号 ([]) 来保持 new 的结构原样（data.frame）。

（在我的回答中，我使用df 而不是table 和look 而不是lookup）

【讨论】：

为什么这会使new中的所有字段除了那些在改变的列中消失？
match 会产生 NA，我发现这是一个问题：请参阅此示例 match(1:6,c(1,3,4,2,5))

【解决方案7】：

另一个选项是tidyr 和dplyr 的组合

library(dplyr)
library(tidyr)
table %>%
   gather(key = "pet") %>%
   left_join(lookup, by = "pet") %>%
   spread(key = pet, value = class)

【讨论】：

我用这个解决方案得到了所有的 NA。这一定是我的设置：'table
我认为此解决方案的更新版本类似于：table %>% gather(key = "pet") %>% left_join(lookup, by = "pet") %>% spread(key = pet, value = class)，因为收集和传播已被 pivot_longer 和 pivot_wider tidyverse.org/blog/2019/09/tidyr-1-0-0取代。