【问题标题】:How to replace factor levels in multiples columns of a data frame based on the match lookup data frame using R如何使用 R 基于匹配查找数据框替换数据框多列中的因子级别
【发布时间】:2018-12-14 11:22:58
【问题描述】:

df1 中与数据框lookup_df 中的lab_pt 匹配的级别我想用lookup_df 的第二列中的相应级别替换(此处为:lab_en)。但我想保持其余部分保持原样。 非常感谢!

--

主要数据框

df1 <- data.frame(
            num_var = sample(200, 15),
            col1 = rep(c("onda","estrela","rato","caneta","ceu"), 3),
            col2 = rep(c("muro","gato","pa","rato","ceu"), 3),
            col3 = rep(c("surf","onda","dente","onda","sei"), 3),
            col3 = rep(c("onda","casa",NA,"nao","net"), 3))

查找数据框

lookup_df <- data.frame(
            lab_pt = c("onda","estrela","rato","caneta","ceu"),
            lab_en = c("wave","star","rat","pen","sky"))

我在下面尝试过这个。它完成了这项工作,但不匹配的信息被转换为 NA,这是我不想要的。

rownames(lookup_df) <- lookup_df$lab_pt
apply(df1[,2:ncol(df1)], 2, function(x) lookup_df[as.character(x),]$lab_en)

这里的这篇文章非常相似,但在那种情况下,所有级别都是可匹配的,与这里的不同。非常感谢! Replace values in a dataframe based on lookup table

【问题讨论】:

    标签: r


    【解决方案1】:

    我认为应该这样做,使用data.table 包。它确实重新排序了 id,这是一个问题吗?

    # added seed
    # changed col3 to col4
    set.seed(1)
    df1 <- data.frame(
      num_var = sample(200, 15),
      col1 = rep(c("onda","estrela","rato","caneta","ceu"), 3),
      col2 = rep(c("muro","gato","pa","rato","ceu"), 3),
      col3 = rep(c("surf","onda","dente","onda","sei"), 3),
      col4 = rep(c("onda","casa",NA,"nao","net"), 3))
    
    lookup_df <- data.frame(
      lab_pt = c("onda","estrela","rato","caneta","ceu"),
      lab_en = c("wave","star","rat","pen","sky"))
    
    # data.table solution
    library(data.table)
    
    # change from wide to long, to make merge easier
    dt <- melt(as.data.table(df1), id.vars="num_var")
    
    # merge in the new values to original data
    dt2 <- merge(dt, lookup_df, by.x="value", by.y="lab_pt",
                 all.x=TRUE)
    
    # if its missing, replace with original value
    dt2[is.na(lab_en), lab_en := value]
    
    # convert back from long to wide
    dt3 <- dcast(dt2[, .(num_var, variable, lab_en)], num_var~variable,
                value.var="lab_en")
    
    # back to data.frame
    output <- as.data.frame(dt3)
    

    每当您在表之间进行合并时,处理长格式数据通常会更好,其中您有一个组列和一个值列。这意味着您不需要多次运行相同的操作(合并)。

    【讨论】:

    • 非常感谢!你的方法非常好。我会用那个!我将在此处添加(见下文)以将数据框排序为原始表单。 output[ order(match(output$num_var, df1$num_var)), ]
    【解决方案2】:

    我认为这可能会对您有所帮助,虽然它会创建一个新列但会完成工作

    df1$new <- lookup_df[match(df1$col1, lookup_df$lab_pt),2]
    

    【讨论】:

    • 好的。谢谢!这确实是一项不错的工作,但只替换一列。也许我会在数据框中的所有列上循环使用你的方法。
    【解决方案3】:

    您可以执行以下操作:

    lookup_vec <- setNames(as.character(lookup_df[["lab_en"]]), lookup_df[["lab_pt"]])
    #   onda estrela    rato  caneta     ceu 
    # "wave"  "star"   "rat"   "pen"   "sky" 
    factors_vars <- names(df1)[sapply(df1, is.factor)]
    for (var in factors_vars) {
      w <- which(levels(df1[[var]]) %in% names(lookup_vec)) # Get only those that are "matchable"
      levels(df1[[var]])[w] <- lookup_vec[levels(df1[[var]])[w]]
    }
    df1
    
       num_var col1 col2  col3 col3.1
    1       21 wave muro  surf   wave
    2      104 star gato  wave   casa
    3       60  rat   pa dente   <NA>
    4      183  pen  rat  wave    nao
    5      123  sky  sky   sei    net
    6       17 wave muro  surf   wave
    7       34 star gato  wave   casa
    8      126  rat   pa dente   <NA>
    9      139  pen  rat  wave    nao
    10      35  sky  sky   sei    net
    11     149 wave muro  surf   wave
    12       8 star gato  wave   casa
    13      46  rat   pa dente   <NA>
    14      32  pen  rat  wave    nao
    15     162  sky  sky   sei    net
    

    【讨论】:

    • 男人!非常感谢!我将您非常简单有效的想法放在一个函数中,它工作得非常好!实际上,这个函数应该进一步开发并包含在一个适当的数据处理包中!非常有用。
    【解决方案4】:

    这是使用dplyr 包的解决方案。 请注意参数stringAsFactor=F 将单词保留为字符串。

       df1 <- data.frame(
          num_var = sample(200, 15),
          col1 = rep(c("onda","estrela","rato","caneta","ceu"), 3),
          col2 = rep(c("muro","gato","pa","rato","ceu"), 3),
          col3 = rep(c("surf","onda","dente","onda","sei"), 3),
          col3 = rep(c("onda","casa",NA,"nao","net"), 3), stringsAsFactors = F)
    
        lookup_df <- data.frame(
          lab_pt = c("onda","estrela","rato","caneta","ceu"),
          lab_en = c("wave","star","rat","pen","sky"), stringsAsFactors = F)
    
    
        library(dplyr)
    
        df1 %>% mutate(col1=replace(col1, col1 %in% lookup_df$lab_pt, lookup_df$lab_en)) %>% 
          mutate(col2=replace(col2, col2 %in% lookup_df$lab_pt, lookup_df$lab_en)) %>% 
          mutate(col3=replace(col3, col3 %in% lookup_df$lab_pt, lookup_df$lab_en)) %>%
          mutate(col3.1=replace(col3.1, col3.1 %in% lookup_df$lab_pt, lookup_df$lab_en))
    

    我承认为数据帧的每一列使用一行有点乏味。无法找到同时为所有列执行此操作的方法。

       num_var col1 col2  col3 col3.1
    1        6 wave muro  surf   wave
    2       84 star gato  wave   casa
    3      146  rat   pa dente   <NA>
    4      133  pen wave  star    nao
    5       47  sky star   sei    net
    6      116 wave muro  surf   star
    7       81 star gato   rat   casa
    8      118  rat   pa dente   <NA>
    9      186  pen  rat   pen    nao
    10     161  sky  pen   sei    net
    11     135 wave muro  surf    rat
    12      31 star gato   sky   casa
    13     174  rat   pa dente   <NA>
    14     187  pen  sky  wave    nao
    15     178  sky wave   sei    net
    

    【讨论】:

      【解决方案5】:
      # Fake dataframe
      df1 <- tibble(
              num_var = sample(200, 15),
              col1 = rep(c("onda","estrela","rato","caneta","ceu"), 3),
              col2 = rep(c("muro","gato","pa","rato","ceu"), 3),
              col3 = rep(c("surf","onda","dente","onda","sei"), 3),
              col4 = rep(c("onda","casa",NA,"nao","net"), 3))
      
      # Lookup dictionary dataframe
      lookup_dat <- tibble(
              lab_pt = c("onda","estrela","rato","caneta","ceu"),
              lab_en = c("wave","star","rat","pen","sky")) 
      
      #******************************************************************
      #
      # Translation by replacement of lookup dictionary 
      # Developed to generate Rmd report with labels of plots in different languages
      replace_level <- function(df, lookup_df, col_langu_in, col_langu_out){
              library(data.table)
              # function to replace levels in the df given a reference list in 
              # another df when level match it replace with the correspondent 
              #level in the same row name but in other column.
              # !!!! Variables col_langu need to be quoted 
                 # 1) Below it creates a dictionary style with the reference df (2cols)
               lookup_vec <- setNames(as.character(lookup_df[[col_langu_out]]), 
                                     lookup_df[[col_langu_in]])
                 # 2) iterating over main df col names
               for (i in names(df)) { # select cols?: names(df)[sapply(df, is.factor)]
                 # 3) return index of levels from df levels matching with those from 
                       # the dictionary type to replace (for each cols of df i)
                       if(is.character(df[[i]])){df[i] <- as.factor(df[[i]])}
                       # Changing from character to factor before the translation
                       index_match <- which(levels(df[[i]]) %in% 
                                                    names(lookup_vec))
                 # 4) replacing matchable levels based on the index on step 3).
                       # with the reference to translate
                       levels(df[[i]])[index_match] <- 
                               lookup_vec[levels(df[[i]])[index_match]]}
               return(df)}
      
      # test here
      replace_level(df1, lookup_dat, "lab_pt", "lab_en")
      

      【讨论】:

      • 现在可以使用了。它可能会引起某人的兴趣。它对我有用 ;-)。
      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-12-10
      • 2014-05-14
      • 2021-08-29
      • 1970-01-01
      • 1970-01-01
      • 2020-07-13
      相关资源
      最近更新 更多