【问题标题】:R Match data tables using string matchingR使用字符串匹配匹配数据表
【发布时间】:2017-05-02 15:06:05
【问题描述】:

我有两个数据表:

dt1 <- data.table(V1=c("Apple Pear Orange, AAA111", "Grapes Banana Pear .BBB222", "Orange Kiwi Melon ,CCC333.", "Apple DDD444, Pear Orange", "Kiwi Melon Orange, CCC333", "Apple Pear Orange, AAA111", "Tomato Cucumber-EEE222", "Seagull Pigeon ZZZ111" ), stringsAsFactors = F)

dt2 <- data.table(Code=c("AAA111", "AAA222", "AAA333", "AAA444", "AAA555", "AAA666", "BBB111", "BBB222", "BBB333", "BBB444", "BBB555", "BBB666", "CCC111", "CCC222", "CCC333", "CCC444", "CCC555", "CCC666", "DDD111", "DDD222", "DDD333", "DDD444", "DDD555", "DDD666", "EEE111", "EEE222", "EEE333", "EEE444", "EEE555", "EEE666"), stringsAsFactors = F)
dt2$Ref <- 1:nrow(dt2)

dt1 中的每一行都包含一个未格式化的字符串,其中包含一个“代码”。 dt2 包含可以匹配的代码列表。我所追求的是一种方法,用于识别dt1 的每一行中字符串的“代码”部分,然后与dt2 中的相应代码匹配。如果dt2 中没有匹配的代码,则返回 NA。

这是我所追求的输出类型:

dt3 <- data.table(V1=c("Apple Pear Orange, AAA111", "Grapes Banana Pear .BBB222", "Orange Kiwi Melon ,CCC333.", "Apple DDD444, Pear Orange", "Kiwi Melon Orange, CCC333", "Apple Pear Orange, AAA111", "Tomato Cucumber-EEE222", "Seagull Pigeon ZZZ111"), Code=c("AAA111", "BBB222", "CCC333", "DDD444", "CCC333", "AAA111", "EEE222", "NA"), Ref=c("1", "8", "15", "22", "15", "1", "26", "NA"), stringsAsFactors = F)

我尝试使用正则表达式、grep 等来寻找解决方案,但没有找到任何解决方案。

【问题讨论】:

    标签: r data.table string-matching


    【解决方案1】:

    您可以使用我的fuzzyjoin 包中的regex_left_join

    library(fuzzyjoin)
    regex_left_join(dt1, dt2, by = c(V1 = "Code"))
    #>                            V1   Code Ref
    #> 1:  Apple Pear Orange, AAA111 AAA111   1
    #> 2: Grapes Banana Pear .BBB222 BBB222   8
    #> 3: Orange Kiwi Melon ,CCC333. CCC333  15
    #> 4:  Apple DDD444, Pear Orange DDD444  22
    #> 5:  Kiwi Melon Orange, CCC333 CCC333  15
    #> 6:  Apple Pear Orange, AAA111 AAA111   1
    #> 7:     Tomato Cucumber-EEE222 EEE222  26
    #> 8:      Seagull Pigeon ZZZ111     NA  NA
    

    【讨论】:

    • 谢谢。完全符合我的要求。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2020-09-19
    • 2016-04-08
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多