【问题标题】:Merging two data frame based on maximum numbers of words in commonin R基于commonin R中的最大字数合并两个数据帧
【发布时间】:2021-07-30 13:31:46
【问题描述】:

我有两个 data.frame 一个包含部分名称,另一个包含完整名称,如下所示

partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF",
"wizz air", "WeMove.eu", "ILU")
full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe",
"World Wide Fundation (WWF)", "(ILU)", "Ilusion")

在理想的世界里,我很想有一个这样的表(我真正的部分 df 有 12 794 行)

print(partial)
partial full
Apple   Apple Inc
Apple   Apple Inc
WWF World Wide Fundation (WWF)
wizz air wizzair
WeMove.eu We Move Europe
... 12 794 total rows

对于没有答案的每一行,我想成为NA

我尝试了很多东西,fuzzyjoinregexregex_left_join 甚至还有包 sqldf。我有一些结果,但我知道如果regex_left_join 明白我正在寻找我在stringr 中知道的单词会更好,boundary( type = c("word")) 存在但我不知道如何实现它。

目前,我只准备了部分 df,以去除非字母数字信息并将其变为小写。

partial$regex <- str_squish((str_replace_all(partial$partial.name, regex("\\W+"), " ")))
partial$regex <- tolower(partial$regex)

如何根据最大共同词数将partial$partial.name full$full.name 匹配?

【问题讨论】:

    标签: r stringr sqldf stringdist fuzzyjoin


    【解决方案1】:

    部分字符串匹配需要很长时间才能正确匹配。我相信 Jaro-Winkler 距离是一个不错的选择,但您需要花时间调整参数。这是一个让你开始的例子。

    library(stringdist)
    
    partial <- data.frame( "partial.name" = c("Apple", "Apple", "WWF", "wizz air", "WeMove.eu", "ILU", 'None'), stringsAsFactors = F)
    full <- data.frame("full.name" = c("Apple Inc", "wizzair", "We Move Europe", "World Wide Foundation (WWF)", "(ILU)", "Ilusion"), stringsAsFactors = F)
    
    mydist <- function(partial, list_of_fulls, method='jw', p = 0, threshold = 0.4) {
        find_dist <- function(first, second, method = method, p = p) {
            stringdist(a = first, b = second, method = method, p = p)
        }
        distances <- unlist(lapply(list_of_fulls, function(full) find_dist(first = full, second = partial, method = method, p = p)))
        # If the distance is too great assume NA 
        if (min(distances) > threshold) {
            NA
        } else {
            closest_index <- which.min(distances)
            list_of_fulls[closest_index]
        }
    }
    
    partial$match <- unlist(lapply(partial$partial.name, function(partial) mydist(partial = partial, list_of_fulls = full$full.name, method = 'jw')))
    
    partial
    #  partial.name                       match
    #1        Apple                   Apple Inc
    #2        Apple                   Apple Inc
    #3          WWF World Wide Foundation (WWF)
    #4     wizz air                     wizzair
    #5    WeMove.eu              We Move Europe
    #6          ILU                       (ILU)
    #7         None                        <NA>
    

    【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 2017-02-17
    • 2014-04-14
    • 2019-03-05
    • 2020-07-17
    • 2020-10-10
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多