【问题标题】:How can I get the precise common "max.distance" value for fuzzy string matching using agrep?如何使用 agrep 获得模糊字符串匹配的精确通用“max.distance”值?
【发布时间】:2018-09-11 10:25:58
【问题描述】:

我正在尝试使用 agrep 找出两个字符串名称之间模糊字符串匹配的最佳精度。

但是,我需要选择一个精度“max.distance”,以便在我尝试匹配的所有字符串中应用相同的精度,因为字符串的数量很大。 无法为我尝试匹配的每个字符串选择最佳精度值“max.distance”。

例如,假设我为每个“BANK OF AMERICA CORP”和“1st Capital Bank”使用精度“max.distance”作为“0.2”、“0.1”和“0.05”。

首先,以下是“美国银行”的“最大距离”为“0.2”、“0.1”和“0.05”:

    > agrep("BANK OF AMERICA CORP",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.2)
     [1] "BANK OF AMERICA/PRIVATE BANK WEST"   "BANK OF AMERICA SECURITIES"         
     [3] "BANK OF AMERICA SEC LLC"             "BANK OF AMERICA SECURITIES LLC"     
     [5] "BANK OF AMERICA NT & SA"             "BANK OF AMERICA CORP"               
     [7] "ALLIANZ OF AMERICA CORP"             "Bank of America Securities/Vice Pre"
     [9] "Bank of America Securities/Investme" "Bank of America/President"          
    [11] "Bank of America Securities LLC/Prin" "Bank of America Securities LLC/Mana"
    [13] "Bank of America Securities LLC/Inve" "Bank of America Securities/Principa"
    [15] "Bank of America Securities LLC/Bank" "Bank of America Sec/Investment Bank"
    [17] "Bank Of America Securities/Managing" "Bank of America/Chairman--Midwest A"
    [19] "Bank of America Securities LLC/Vice" "Bank of America Corporation/Sales C"
    [21] "Bank of America Securities/Broker"   "Bank of America Corporation/Banker" 
    [23] "Bank of America Corporation/Senior"  "Bank of America Securities/Equity R"
    [25] "Bank of America Corporation/Vice Ch" "BANK OF AMERICA CORPORATION"        
    [27] "BANK OF AMERICA HEADQUARTERS"        "BANK OF AMERICA ADMINISTRATION"     
    [29] "BANK OF AMERICA N A"                 "Bank of America/Commercial Banking" 
    [31] "Bank of America Sec./Investment Ban"
    > 
    > agrep("BANK OF AMERICA CORP",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.1)
    [1] "BANK OF AMERICA CORP"                "ALLIANZ OF AMERICA CORP"            
    [3] "Bank of America Corporation/Sales C" "Bank of America Corporation/Banker" 
    [5] "Bank of America Corporation/Senior"  "Bank of America Corporation/Vice Ch"
    [7] "BANK OF AMERICA CORPORATION"        
    > 
    > agrep("BANK OF AMERICA CORP",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.05)
    [1] "BANK OF AMERICA CORP"                "Bank of America Corporation/Sales C"
    [3] "Bank of America Corporation/Banker"  "Bank of America Corporation/Senior" 
    [5] "Bank of America Corporation/Vice Ch" "BANK OF AMERICA CORPORATION"        

那么下面是“0.2”、“0.1”和“0.05”的“最大距离”的“第一资本银行”:

    > agrep("1st Capital Bank",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.2)
      [1] "HURST CAPITAL PARTNERS"             
      [2] "SOY CAPITAL BANK"                   
      [3] "FIRST CAPITOL BANK OF VICTOR"       
      [4] "OSTERWEIS CAPITAL MANAGEMENT"       
      [5] "1ST NATIONAL BANK"                  
      [6] "FIRST CAPITAL BANK"                 
      [7] "SEATTLE 1ST NAT'L BANK"             
      [8] "FIELD POINT CAPITAL MANAGEMENT"     
      [9] "SUMMERSET CAPITAL MANAGEMENT"       
     [10] "AMERIQUEST CAPITAL ASSOC"           
     [11] "BB&T CAPITAL MARKETS"               
     [12] "HUGHES CAPITAL MANAGEMENT"          
     [13] "WELLS CAPITAL MANAGEMENT"           
     [14] "SUPERIOR ST CAPITAL ADVISORS"       
     [15] "ORMES CAPITAL MARKETS INC"          
     [16] "1ST NAT'L BANK OF IL"               
     [17] "ADVENT CAPITAL MANAGEMENT"          
     [18] "1ST CAPITOL BANK"                   
     [19] "BIONDI REISS CAPITAL MANAGEMENT"    
     [20] "CCYBYS CAPITAL MARKETS"             
     [21] "SEACOAST CAPITAL PARTNERS"          
     [22] "DOUGLAS CAPITAL MANAGEMENT"         
     [23] "HIGHFIELDS CAPITAL MANAGEMENT"      
     [24] "PRECEPT CAPITAL MANAGEMENT LP"      
     [25] "AUGUST CAPITAL MANAGEMENT"          
     [26] "SAKSA CAPITAL MANAGEMENT"           
     [27] "IMS CAPITAL MANAGEMENT"             
     [28] "TRENT CAPITAL MANAGEMENT"           
     [29] "Ormes Capital Management"           
     [30] "GARNET CAPITAL MANAGEMENT LLC"      
     [31] "INTERFASE CAPITAL MANAGERS"         
     [32] "RJS CAPITAL MANAGEMENT INC"         
     [33] "1ST NATIONAL BANK OF DE KALB"       
     [34] "1ST NAT'L BANK OF PHILLIPS CO"      
     [35] "1ST NAT'L BANK OF OKLAHOMA"         
     [36] "PROGRESS CAPITAL MANAGEMENT INC"    
     [37] "CAPITAL BANK & TRUST"               
     [38] "1ST NATL BANK"                      
     [39] "ASB Capital Management/Real Estate" 
     [40] "Sears Capital Management"           
     [41] "Osterweis Capital Management/Invest"
     [42] "Cerberus Capital Management LP/Asse"
     [43] "LVS Capital Management/President"   
     [44] "1st Central Bank/Banker"            
     [45] "Summit Capital Management"          
     [46] "Orwes Capital Markets/Stockbroker"  
     [47] "Ormes Capital Management/Investment"
     [48] "Nevis Capital Management/Investment"
     [49] "Duncan Hurst Capital Management"    
     [50] "Progress Capital Management/Preside"
     [51] "Cerberus Capital Management LP"     
     [52] "Wit Capital/Banker"                 
     [53] "Ormes Capital Markets Inc."         
     [54] "Ormes Capital Markets/President & C"
     [55] "Berents & Hess Capital Management"  
     [56] "Progress Capital Management/Venture"
     [57] "First Capital Bank of KY"           
     [58] "Foothill Capital/Banker"            
     [59] "Pequot Capital Management/Equity Re"
     [60] "First Dominion Capital/Banking"     
     [61] "Greenwhich Capital/Banker"          
     [62] "Veritas Capital Management/Banker"  
     [63] "Veritas Capital Management/Investme"
     [64] "Lesese Capital Management/Investmen"
     [65] "Douglas Capital Management/Investme"
     [66] "FIRST NATINAL BANK OF AMARILLO"     
     [67] "NEVIS CAPITAL MANAGEMENT"           
     [68] "VERITAS CAPITAL MANAGEMENT"         
     [69] "SIEBERT CAPITAL MARKETS"            
     [70] "HOURGLASS CAPITAL MANAGEMENT"       
     [71] "1ST NATIONAL BANK DALHART"          
     [72] "TEXAS CAPITAL BANK"                 
     [73] "NICHOLAS CAPITAL MANAGEMENT"        
     [74] "CERBUS CAPITAL MANAGEMENT"          
     [75] "CROESUS CAPITAL MANAGEMENT"         
     [76] "EAST WEST CAPITAL ASSOCIATES INC"   
     [77] "PRENDERGAST CAPITAL MANAGEMENT"     
     [78] "NANTUCKET CAPITAL MANAGEMENT"       
     [79] "1ST NATIONAL BANK TEMPLE"           
     [80] "ENTRUST CAPITAL INC"                
     [81] "1ST NATIONAL BANK OF IL"            
     [82] "SIMMS CAPITAL MANAGEMENT"           
     [83] "FIRST CAPITAL ADVISORS"             
     [84] "FIRST CAPITAL MANAGEMENT LTD"       
     [85] "1ST NATIONAL BANK & TRUST"          
     [86] "PENTECOST CAPITAL MANAGEMENT INC"   
     [87] "EAST-WEST CAPITAL ASSOCIATES"       
     [88] "1ST NAT'L BANK OF JOLIET"           
     [89] "FIRST CAPITOL BANK OF VICTO"        
     [90] "FIRST CAPITAL FINANCIAL"            
     [91] "PACIFIC COAST CAPITAL PARTNERS"     
     [92] "FIRST CAPITOL BANK"                 
     [93] "FIRST CAPITAL ENGINEERING"          
     [94] "MIDWEST CAPITOL MANAGEMENT"         
     [95] "PEQUOT CAPITAL MANAGEMENT"          
     [96] "AGGOTT CAPITAL MANAGEMENT"          
     [97] "SIMMS CAPITAL MANAGEMENT INC"       
     [98] "PHILLIPS CAPITAL MANAGEMENT LLC"    
     [99] "1ST NATIONAL BANK OF COLD SP"       
    [100] "SOY CAPITOL BANK"                   
    > 
    > agrep("1st Capital Bank",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.1)
    [1] "FIRST CAPITOL BANK OF VICTOR" "FIRST CAPITAL BANK"          
    [3] "1ST CAPITOL BANK"             "First Capital Bank of KY"    
    [5] "TEXAS CAPITAL BANK"           "FIRST CAPITOL BANK OF VICTO" 
    [7] "FIRST CAPITOL BANK"          
    > 
    > agrep("1st Capital Bank",C1999_0[,2],ignore.case = TRUE, value = TRUE,fixed = TRUE,max.distance =0.05)
    [1] "FIRST CAPITAL BANK"       "1ST CAPITOL BANK"        
    [3] "First Capital Bank of KY"

如您所见,很难找到“max.distance”的通用精度值来应用于每个字符串,例如“BANK OF AMERICA CORP”和“1st Capital Bank”。除了这两个之外,我还有更多的公司名称,这就是我难以找到模糊字符串匹配的通用精度值和命令的原因。

C1999_0 的原始数据文件太大而无法附加,因此我认为仅使用如上所示的输出值就足以复制。

我知道有几个子类别需要处理,例如成本、替换、插入等,但它们与仅更改“max.distance”值本身并没有太大区别。

如果我能得到这方面的帮助,我将不胜感激!

【问题讨论】:

    标签: r string-matching agrep


    【解决方案1】:

    agrep 的一个问题是它类似于help("grep") 中记录的grep

    由于有人不小心阅读了描述,甚至提交了错误报告,请注意这匹配x 的每个元素的子字符串(就像grep 一样)而不是整个元素。另请参阅 utils 包中的 adist,它可以选择返回匹配子字符串的偏移量。

    这似乎是您后一个示例中的问题,因为您有许多包含“Capital”或“Bank”或两者的名称。我要做的是使用计算Levenshtein distance(这是agrep 所做的或通用版本,仅适用于子字符串)并采用最短距离的。例如,

    C1999 <- c("HURST CAPITAL PARTNERS", "SOY CAPITAL BANK", "FIRST CAPITOL BANK OF VICTOR", "OSTERWEIS CAPITAL MANAGEMENT", "1ST NATIONAL BANK", "FIRST CAPITAL BANK", "SEATTLE 1ST NAT'L BANK", "FIELD POINT CAPITAL MANAGEMENT", "SUMMERSET CAPITAL MANAGEMENT", "AMERIQUEST CAPITAL ASSOC", "BB&T CAPITAL MARKETS", "HUGHES CAPITAL MANAGEMENT", "WELLS CAPITAL MANAGEMENT", "SUPERIOR ST CAPITAL ADVISORS", "ORMES CAPITAL MARKETS INC", "1ST NAT'L BANK OF IL", "ADVENT CAPITAL MANAGEMENT", "1ST CAPITOL BANK", "BIONDI REISS CAPITAL MANAGEMENT", "CCYBYS CAPITAL MARKETS", "SEACOAST CAPITAL PARTNERS", "DOUGLAS CAPITAL MANAGEMENT", "HIGHFIELDS CAPITAL MANAGEMENT", "PRECEPT CAPITAL MANAGEMENT LP", "AUGUST CAPITAL MANAGEMENT", "SAKSA CAPITAL MANAGEMENT", "IMS CAPITAL MANAGEMENT", "TRENT CAPITAL MANAGEMENT", "Ormes Capital Management", "GARNET CAPITAL MANAGEMENT LLC", "INTERFASE CAPITAL MANAGERS", "RJS CAPITAL MANAGEMENT INC", "1ST NATIONAL BANK OF DE KALB", "1ST NAT'L BANK OF PHILLIPS CO", "1ST NAT'L BANK OF OKLAHOMA", "PROGRESS CAPITAL MANAGEMENT INC", "CAPITAL BANK & TRUST", "1ST NATL BANK", "ASB Capital Management/Real Estate", "Sears Capital Management", "Osterweis Capital Management/Invest", "Cerberus Capital Management LP/Asse", "LVS Capital Management/President", "1st Central Bank/Banker", "Summit Capital Management", "Orwes Capital Markets/Stockbroker", "Ormes Capital Management/Investment", "Nevis Capital Management/Investment", "Duncan Hurst Capital Management", "Progress Capital Management/Preside", "Cerberus Capital Management LP", "Wit Capital/Banker", "Ormes Capital Markets Inc.", "Ormes Capital Markets/President & C", "Berents & Hess Capital Management", "Progress Capital Management/Venture", "First Capital Bank of KY", "Foothill Capital/Banker", "Pequot Capital Management/Equity Re", "First Dominion Capital/Banking", "Greenwhich Capital/Banker", "Veritas Capital Management/Banker", "Veritas Capital Management/Investme", "Lesese Capital Management/Investmen", "Douglas Capital Management/Investme", "FIRST NATINAL BANK OF AMARILLO", "NEVIS CAPITAL MANAGEMENT", "VERITAS CAPITAL MANAGEMENT", "SIEBERT CAPITAL MARKETS", "HOURGLASS CAPITAL MANAGEMENT", "1ST NATIONAL BANK DALHART", "TEXAS CAPITAL BANK", "NICHOLAS CAPITAL MANAGEMENT", "CERBUS CAPITAL MANAGEMENT", "CROESUS CAPITAL MANAGEMENT", "EAST WEST CAPITAL ASSOCIATES INC", "PRENDERGAST CAPITAL MANAGEMENT", "NANTUCKET CAPITAL MANAGEMENT", "1ST NATIONAL BANK TEMPLE", "ENTRUST CAPITAL INC", "1ST NATIONAL BANK OF IL", "SIMMS CAPITAL MANAGEMENT", "FIRST CAPITAL ADVISORS", "FIRST CAPITAL MANAGEMENT LTD", "1ST NATIONAL BANK & TRUST", "PENTECOST CAPITAL MANAGEMENT INC", "EAST-WEST CAPITAL ASSOCIATES", "1ST NAT'L BANK OF JOLIET", "FIRST CAPITOL BANK OF VICTO", "FIRST CAPITAL FINANCIAL", "PACIFIC COAST CAPITAL PARTNERS", "FIRST CAPITOL BANK", "FIRST CAPITAL ENGINEERING", "MIDWEST CAPITOL MANAGEMENT", "PEQUOT CAPITAL MANAGEMENT", "AGGOTT CAPITAL MANAGEMENT", "SIMMS CAPITAL MANAGEMENT INC", "PHILLIPS CAPITAL MANAGEMENT LLC", "1ST NATIONAL BANK OF COLD SP", "SOY CAPITOL BANK")
    
    func <- function(x, y, tol = 0L){
      require(stringdist)
      dista <- stringdist::stringdist(x, y, method = "lv")
      min_dista <- min(dista)
      y[dista <= min_dista + tol]
    }
    func("1st Capital Bank", C1999)
    #R [1] "Wit Capital/Banker"
    func("1st Capital Bank", C1999, 4L)
    #R [1] "Wit Capital/Banker"       "First Capital Bank of KY"
    func("1st Capital Bank", C1999, 10L)
    #R  [1] "SOY CAPITAL BANK"           "1ST NATIONAL BANK"         
    #R  [3] "FIRST CAPITAL BANK"         "1ST CAPITOL BANK"          
    #R  [5] "Ormes Capital Management"   "1ST NATL BANK"             
    #R  [7] "Sears Capital Management"   "1st Central Bank/Banker"   
    #R  [9] "Summit Capital Management"  "Wit Capital/Banker"        
    #R [11] "Ormes Capital Markets Inc." "First Capital Bank of KY"  
    #R [13] "Foothill Capital/Banker"    "Greenwhich Capital/Banker" 
    #R [15] "TEXAS CAPITAL BANK"         "FIRST CAPITOL BANK"        
    #R [17] "SOY CAPITOL BANK" 
    
    # ignoring cases
    func <- function(x, y, tol = 0L){
      require(stringdist)
      dista <- stringdist::stringdist(tolower(x), tolower(y), method = "lv")
      min_dista <- min(dista)
      y[dista <= min_dista + tol]
    }
    func("1st Capital Bank", C1999, 0L)
    #R [1] "1ST CAPITOL BANK"
    

    func 中的 tol 参数控制是否要包含远离最小 Levenshtein 距离的 tol 的示例。我发现我没有准确回答您的要求(如何使用agrep 获得模糊字符串匹配的精确常见“max.distance”值?)但我认为我的答案可能是你在找什么。

    我使用stringdist::stringdist 而不是adist,因为前者似乎更快。它仍然可能有点慢,我希望在那里有一个可以设置最大距离的 R 包,但我还没有遇到过这样的包。这可以使(然后有上限的)Levenshtein 距离的计算更快。

    【讨论】:

    • 非常感谢您的帮助。但是,当我有大量数据要匹配时,我会发现太多“NA”,以至于我什至看不到匹配的那些。有没有办法整理出显示匹配的那些?或者当我直接使用“read.table”命令中的数据时,我得到的这个巨大的“NA”是某种错误?
    • 那么您传递给函数的数据中有NAs 还是返回NAs 的stringdist::stringdist
    • 我还发现当我将数据长度更改为不同时,结果不是累积的。例如,如果我搜索原始数据 C1999[1:10],这会给出与搜索 C1999[1:100] 时不同且不累积的结果。我认为这是一个我担心的问题。
    • 返回 NA 的是 stringdist::stringdist。我的数据根本没有 NA。
    • 优点是agrep 在我写的时候会查看substrings。例如,agrep("1st Capital Bank", "1ST CAPITOL BANK", max.distance = 1L, ignore.case = TRUE)agrep("1st Capital Bank", "1ST CAPITOL BANK of some country I have never heard of", max.distance = 1L, ignore.case = TRUE) 都给你匹配。这就是为什么你得到如此长的输出,其中许多似乎无关。我展示的解决方案着眼于整个字符串
    【解决方案2】:

    这似乎是一个无法解决的问题,因为没有一个 max.distance 可以很好地适用于所有输入字符串。

    可能值得尝试使用tf-idf 之类的方法来识别字符串的异常性并将您的 max.distance 缩放到该值。因此,“Ziggurat Mutual”可能比“First Bank National”更通用。

    您也可以考虑使用 blurjoin 包,它提供了一些快速的方法来尝试不同的选项。例如,您可以尝试:

    df <- c("HURST CAPITAL PARTNERS", "SOY CAPITAL BANK", "FIRST CAPITOL BANK OF VICTOR", "OSTERWEIS CAPITAL MANAGEMENT", "1ST NATIONAL BANK", "FIRST CAPITAL BANK", "SEATTLE 1ST NAT'L BANK", "FIELD POINT CAPITAL MANAGEMENT", "SUMMERSET CAPITAL MANAGEMENT", "AMERIQUEST CAPITAL ASSOC", "BB&T CAPITAL MARKETS", "HUGHES CAPITAL MANAGEMENT", "WELLS CAPITAL MANAGEMENT", "SUPERIOR ST CAPITAL ADVISORS", "ORMES CAPITAL MARKETS INC", "1ST NAT'L BANK OF IL", "ADVENT CAPITAL MANAGEMENT", "1ST CAPITOL BANK", "BIONDI REISS CAPITAL MANAGEMENT", "CCYBYS CAPITAL MARKETS", "SEACOAST CAPITAL PARTNERS", "DOUGLAS CAPITAL MANAGEMENT", "HIGHFIELDS CAPITAL MANAGEMENT", "PRECEPT CAPITAL MANAGEMENT LP", "AUGUST CAPITAL MANAGEMENT", "SAKSA CAPITAL MANAGEMENT", "IMS CAPITAL MANAGEMENT", "TRENT CAPITAL MANAGEMENT", "Ormes Capital Management", "GARNET CAPITAL MANAGEMENT LLC", "INTERFASE CAPITAL MANAGERS", "RJS CAPITAL MANAGEMENT INC", "1ST NATIONAL BANK OF DE KALB", "1ST NAT'L BANK OF PHILLIPS CO", "1ST NAT'L BANK OF OKLAHOMA", "PROGRESS CAPITAL MANAGEMENT INC", "CAPITAL BANK & TRUST", "1ST NATL BANK", "ASB Capital Management/Real Estate", "Sears Capital Management", "Osterweis Capital Management/Invest", "Cerberus Capital Management LP/Asse", "LVS Capital Management/President", "1st Central Bank/Banker", "Summit Capital Management", "Orwes Capital Markets/Stockbroker", "Ormes Capital Management/Investment", "Nevis Capital Management/Investment", "Duncan Hurst Capital Management", "Progress Capital Management/Preside", "Cerberus Capital Management LP", "Wit Capital/Banker", "Ormes Capital Markets Inc.", "Ormes Capital Markets/President & C", "Berents & Hess Capital Management", "Progress Capital Management/Venture", "First Capital Bank of KY", "Foothill Capital/Banker", "Pequot Capital Management/Equity Re", "First Dominion Capital/Banking", "Greenwhich Capital/Banker", "Veritas Capital Management/Banker", "Veritas Capital Management/Investme", "Lesese Capital Management/Investmen", "Douglas Capital Management/Investme", "FIRST NATINAL BANK OF AMARILLO", "NEVIS CAPITAL MANAGEMENT", "VERITAS CAPITAL MANAGEMENT", "SIEBERT CAPITAL MARKETS", "HOURGLASS CAPITAL MANAGEMENT", "1ST NATIONAL BANK DALHART", "TEXAS CAPITAL BANK", "NICHOLAS CAPITAL MANAGEMENT", "CERBUS CAPITAL MANAGEMENT", "CROESUS CAPITAL MANAGEMENT", "EAST WEST CAPITAL ASSOCIATES INC", "PRENDERGAST CAPITAL MANAGEMENT", "NANTUCKET CAPITAL MANAGEMENT", "1ST NATIONAL BANK TEMPLE", "ENTRUST CAPITAL INC", "1ST NATIONAL BANK OF IL", "SIMMS CAPITAL MANAGEMENT", "FIRST CAPITAL ADVISORS", "FIRST CAPITAL MANAGEMENT LTD", "1ST NATIONAL BANK & TRUST", "PENTECOST CAPITAL MANAGEMENT INC", "EAST-WEST CAPITAL ASSOCIATES", "1ST NAT'L BANK OF JOLIET", "FIRST CAPITOL BANK OF VICTO", "FIRST CAPITAL FINANCIAL", "PACIFIC COAST CAPITAL PARTNERS", "FIRST CAPITOL BANK", "FIRST CAPITAL ENGINEERING", "MIDWEST CAPITOL MANAGEMENT", "PEQUOT CAPITAL MANAGEMENT", "AGGOTT CAPITAL MANAGEMENT", "SIMMS CAPITAL MANAGEMENT INC", "PHILLIPS CAPITAL MANAGEMENT LLC", "1ST NATIONAL BANK OF COLD SP", "SOY CAPITOL BANK")
    
    library(dplyr); library(fuzzyjoin)
    df <- df %>% as_data_frame()
    
    df %>%
      # Allowable methods include osa, lv, dl, hamming, lcs, qgram, 
      #    cosine, jaccard, jw, soundex
      fuzzyjoin::stringdist_inner_join(df, method = "lv", distance_col = "distance", max_dist = 4) %>%
      filter(distance > 0)
    
    Joining by: "value"
    # A tibble: 70 x 3
       value.x                      value.y                     distance
       <chr>                        <chr>                          <dbl>
     1 SOY CAPITAL BANK             1ST CAPITOL BANK                   4
     2 SOY CAPITAL BANK             SOY CAPITOL BANK                   1
     3 FIRST CAPITOL BANK OF VICTOR FIRST CAPITOL BANK OF VICTO        1
     4 1ST NATIONAL BANK            1ST NATL BANK                      4
     5 FIRST CAPITAL BANK           1ST CAPITOL BANK                   4
     6 FIRST CAPITAL BANK           FIRST CAPITOL BANK                 1
     7 HUGHES CAPITAL MANAGEMENT    DOUGLAS CAPITAL MANAGEMENT         4
     8 HUGHES CAPITAL MANAGEMENT    AUGUST CAPITAL MANAGEMENT          4
     9 WELLS CAPITAL MANAGEMENT     IMS CAPITAL MANAGEMENT             4
    10 WELLS CAPITAL MANAGEMENT     NEVIS CAPITAL MANAGEMENT           3
    

    ...在您的示例列表中试验潜在的不完全匹配。

    【讨论】:

    • 谢谢你,如果你不介意,我可以把你的想法做成一个可执行的形式吗?不幸的是,这种形式似乎不可复制..
    • 编辑了我的答案以包括库调用并将字符串转换为模糊连接之前的数据框。
    • 谢谢。因此,如果我是正确的,您的方法是否会采用整个字符串数据(例如“df”)并在同一字符串数据中搜索以计算其自身的距离?如果是这样, value.x 是原始数据,而 value.y 是他们找到的建议的唯一字符串数据吗?另外,它是否也忽略了我需要的大写字母?
    • 是的。 value.y 是列表中与 value.x 不完全相同的其他可能匹配项(因为最后一行将它们过滤掉),但在模糊连接的标准范围内。如果您想忽略大小写,您可以在名称的大写版本之间进行比较,例如df &lt;- data.frame(name = "Hurst Capital Partners"); df$upper &lt;- toupper(df$name)
    • 在这种情况下,它表示距 SOY CAPITAL BANK 的距离为 4,距 FIRST CAPITAL BANK 的距离为 4。每行是两个名称之间的配对关系。
    猜你喜欢
    • 2018-05-31
    • 1970-01-01
    • 2021-02-06
    • 2017-10-28
    • 1970-01-01
    • 2020-10-29
    • 2015-10-26
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多