【问题标题】:R: merge two dataframes based on substringR:基于子字符串合并两个数据帧
【发布时间】:2021-02-01 11:43:07
【问题描述】:

我有两个数据框。 df1 看起来像:

           Day     Element    Incident
1   2020-04-06     3101       Check incident by SOILING
2   2020-04-02     3102       Check alarm 5662
3   2020-05-21     3101       Check energy loss by METEO ERROR
4   2020-04-02     3202       Check ACDC grid

另一个,df2,看起来像这样:

         Day     Element  Incident       Energy_loss
1 2020-04-06     3101     SOILING        0.05
2 2020-04-14     3101     SOILING        0.01
3 2020-05-21     3101     METEO ERROR    0.11
4 2020-06-15     3102     METEO ERROR    0.03

我想根据DayElementIncident 列合并它们,所以我需要找到df1 中的Incident 列何时包含df2 的列Incident . df1df2 不匹配的行可以在Energy loss 列中留下Nan

我已经尝试过通常的合并,但由于merge 的条件之一是子字符串,它不能正常工作。

我期望的输出是:

           Day     Element    Incident                          Energy loss
1   2020-04-06     3101       Check incident by SOILING                0.05
2   2020-04-02     3102       Check alarm 5662                          Nan
3   2020-05-21     3101       Check energy loss by METEO ERROR         0.11
4   2020-04-02     3202       Check ACDC grid                           Nan

【问题讨论】:

    标签: r merge substring


    【解决方案1】:

    我们可以使用regex_left_join

    library(dplyr)
    library(fuzzyjoin)
    regex_left_join(df1, df2, by = c('Day', 'Element', 'Incident')) %>% 
        select(Day = Day.x, Element = Element.x, Incident = Incident.x, Energy_loss)
    

    -输出

    #       Day Element                         Incident Energy_loss
    #1 2020-04-06    3101        Check incident by SOILING        0.05
    #2 2020-04-02    3102                 Check alarm 5662          NA
    #3 2020-05-21    3101 Check energy loss by METEO ERROR        0.11
    #4 2020-04-02    3202                  Check ACDC grid          NA
    

    数据

    df1 <- structure(list(Day = c("2020-04-06", "2020-04-02", "2020-05-21", 
    "2020-04-02"), Element = c(3101L, 3102L, 3101L, 3202L), 
    Incident = c("Check incident by SOILING", 
    "Check alarm 5662", "Check energy loss by METEO ERROR", "Check ACDC grid"
    )), class = "data.frame", row.names = c("1", "2", "3", "4"))
    
    df2 <- structure(list(Day = c("2020-04-06", "2020-04-14", "2020-05-21", 
    "2020-06-15"), Element = c(3101L, 3101L, 3101L, 3102L), Incident = c("SOILING", 
    "SOILING", "METEO ERROR", "METEO ERROR"), Energy_loss = c(0.05, 
    0.01, 0.11, 0.03)), class = "data.frame", row.names = c("1", 
    "2", "3", "4"))
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-03-18
      • 2021-07-30
      • 1970-01-01
      • 1970-01-01
      • 2020-09-07
      • 2014-04-14
      • 2019-03-05
      • 2021-01-29
      相关资源
      最近更新 更多