【问题标题】:Replace values in one column based on part of text in another dataframe in R根据R中另一个数据框中的部分文本替换一列中的值
【发布时间】:2021-08-20 10:55:59
【问题描述】:

这是我在 stackoverflow 上的第一篇文章,英语不是我的第一语言,所以对于语法和编程方面的任何错误,我会提前道歉。

我需要根据另一个数据框中的部分值替换我的数据框中的一列中的值。我的问题与post here 类似,但在他们的示例中,他们列出了所有可能的错误。就我而言,我只需要字符串的一部分就可以知道是否需要替换一个值。

我已经尝试在 dplyr 中使用“if_else”和“grepl”。只要我在第二个数据帧上只有一行,“Grepl”就可以工作,当我插入另一个示例时会出错。

现在我的真实 DF 有大约 30k 行和 33 个变量,并且具有正确值的第二个 DF 可能每个月都在增长,所以我尽量避免循环。

我用随机数据制作了一个模拟表来模拟我的需求:

library(dplyr)


df1 <- data.frame(Supplier = c("AAA","CCC","CCE","DDD","EEE","EED","GGG","HHH","III","JJJ"), 
                  Value = c(100,200,300,400,200, 100,200,40,150,70))
                  
                  
df2 <- data.frame(Supplier =c("CC","EE","GG"), 
                  New_Supplier = c("Red","Blue","Green"))


#Example 1: Unfortunately this Won't work unless I have an exact match:
df1$Supplier <- if_else(df1$Supplier %in% df2$Supplier, df2$New_Supplier, df1$Supplier)

# Example 2: Only works if I have one example:
df1$Supplier <- if_else(grepl(df2$Supplier, df1$Supplier), df2$New_Supplier, df1$Supplier)

所以我在第一个数据帧上有这个:

   Supplier Value
1       AAA   100
2       CCC   200
3       CCE   300
4       DDD   400
5       EEE   200
6       EED   100
7       GGG   200
8       HHH    40
9       III   150
10      JJJ    70

这在第二个数据框上:

  Supplier New_Supplier
1       CC          Red
2       EE         Blue
3       GG        Green

我的最终目标是拥有这样的东西:

  Supplier Value
1       AAA   100
2       Red   200
3       Red   300
4       DDD   400
5      Blue   200
6      Blue   100
7     Green   200
8       HHH    40
9       III   150
10      JJJ    70

提前致谢!

【问题讨论】:

    标签: r dataframe replace


    【解决方案1】:

    这似乎是fuzzy_joinregex_left_join 的情况。在regex_left_joincoalecse 之后将这些列放在一起,以便每行返回第一个非 NA 元素

    library(fuzzyjoin)
    library(dplyr)
    regex_left_join(df1, df2, by = 'Supplier') %>% 
        transmute(Supplier = coalesce(New_Supplier, Supplier.x), Value)
    

    -输出

     Supplier Value
    1       AAA   100
    2       Red   200
    3       Red   300
    4       DDD   400
    5      Blue   200
    6      Blue   100
    7     Green   200
    8       HHH    40
    9       III   150
    10      JJJ    70
    

    【讨论】:

      【解决方案2】:

      Base R 方法:

      # Coerce 0 length vectors to na values of the appropriate type: 
      # zero_to_nas => function()
      zero_to_nas <- function(x){
        if(identical(x, character(0))){
          res <- NA_character_ 
        }else if(identical(x, integer(0))){
          res < -NA_integer_
        }else if(identical(x, numeric(0))){
          res <- NA_real_
        }else if(identical(x, complex(0))){
          res <- NA_complex_
        }else if(identical(x, logical(0))){
          res <- NA
        }else{
          res <- x
        }
        
        # If the result is Null return the vector:
        if(is.null(res)){
          res <- x
        }else{
          invisible() 
        }
        
        # Explicitly define returned object: vector => Global Env
        return(res)
        
      }
      
      # Unlist handling 0 length vectors: list_2_vec => function()
      list_2_vec <- function(lst){
        # Unlist cleaned list: res => vector
        res <- unlist(lapply(lst, zero_to_nas))
        # Explictly define return object: vector => GlobalEnv()
        return(res)
      }
      
      # Function to perform a fuzzy match: 
      # fuzzy_match => function()
      fuzzy_match <- function(vec_to_match_to, vec_to_match_on){
        # Perform a fuzzy match: res => character vector:
        res <- list_2_vec(
          regmatches(
            vec_to_match_to, 
            gregexpr(
              paste0(
                vec_to_match_on, 
                collapse = "|"
              ),
              vec_to_match_to
            )
          )
        )
        # Explicitly define returned object: 
        # character vector => Global Env
        return(res)
      }
      
      # Function to coalesce vectors: br_coalesce => function()
      br_coalesce <- function(vec, ..., to_vec = TRUE){
        
        # Coalesce the vectors: res_ir => list
        res_ir <- apply(
          cbind(
              as.list(...), 
              as.list(vec)
            ),
          1,
          function(x){
            head(zero_to_nas(x[!(is.na(x))]), 1)
          }
        )
        
        # If the result is null return the original vector:
        if(is.null(unlist(res_ir))){
          res_ir <- vec
        }else{
          invisible() 
        }
      
        # If the we want the result to be a vector not a list then:
        if(isTRUE(to_vec)){
          # Unlist the resultant list: res => vector
          res <- unlist(res_ir)
          # Otherwise
        }else{
          # Deep copy the list: res => list
          res <- res_ir
        }
        
        # Explicitly define returned object: 
        # list or vector => Global Env
        return(res)
        
      }
      
      # Apply the fuzzy match and coalesce functions: 
      # clean_df => data.frame
      clean_df <- transform(
        df1, 
        Supplier = br_coalesce(
          df1$Supplier, 
          df2$New_Supplier[
            match(
              fuzzy_match(
                df1$Supplier, 
                df2$Supplier
              ),
              df2$Supplier
            )
          ]
        )
      )
      

      数据:

      df1 <- data.frame(Supplier = c("AAA","CCC","CCE","DDD","EEE","EED","GGG","HHH","III","JJJ"), 
                        Value = c(100,200,300,400,200, 100,200,40,150,70))
      
      
      df2 <- data.frame(Supplier =c("CC","EE","GG"), 
                        New_Supplier = c("Red","Blue","Green"))
      

      【讨论】:

        猜你喜欢
        • 2020-09-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2022-06-23
        • 2021-06-11
        相关资源
        最近更新 更多