【问题标题】:Fuzzy matching (not row-to-row) in RR中的模糊匹配(不是行到行)
【发布时间】:2018-06-07 20:47:02
【问题描述】:

我需要按照以下模式进行模糊匹配:表 A 包含带有地址的字符串(我已经预先格式化,例如删除空格等),我必须验证它们的正确性。我有表 B,其中包含所有可能的地址(格式与表 A 相同),所以我不想只将表 A 中的第 1 行与表 B 中的第 1 行相匹配,依此类推,而是比较表中的每一行A 到整个表 B 并为每个表找到最接近的匹配项。

根据我的检查,adistagrep 默认情况下是逐行工作的,通过尝试使用它们,我也会立即得到内存不足的消息。是否有可能在 R 中仅使用 8 GB RAM?

我找到了一个类似问题的示例代码,并以此为基础解决了我的问题,但性能仍然存在问题。它适用于表 A 中的 600 行和表 B 中的 2000 行的样本,但完整的数据集分别为 600000 和 900000 行。

adresy_odl <- adist(TableA$Adres, TableB$Adres, partial=FALSE, ignore.case = TRUE)
min_odl<-apply(adresy_odl, 1, min)

match.s1.s2<-NULL  
for(i in 1:nrow(adresy_odl))
{
  s2.i<-match(min_odl[i],adresy_odl[i,])
  s1.i<-i
  match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=TableB[s2.i,]$Adres, s1name=TableA[s1.i,]$Adres, adist=min_odl[i]),match.s1.s2)
}

内存错误已经发生在第一行(adist 函数):

Error: cannot allocate vector of size 1897.0 Gb

下面是我使用的数据示例 (CSV),tableA 和 tableB 看起来完全一样,唯一的区别是 tableB 包含 Zipcode、Street 和 City 的所有可能组合,而在 tableA 中,大多数是错误的邮政编码或街道拼写有一些错误。

表A:

"","Zipcode","Street","City","Adres"
"33854","80-221","Traugutta","Gdańsk","80-221TrauguttaGdańsk"
"157093","80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk"
"200115","80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk"
"344514","80-318","Wąsowicza","Gdańsk","80-318WąsowiczaGdańsk"
"355415","80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk"
"356414","80-452","Kilińskiego","Gdańsk","80-452KilińskiegoGdańsk"

表B:

"","Zipcode","Street","City","Adres"
"47204","80-180","11Listopada","Gdańsk","80-18011ListopadaGdańsk"
"47205","80-041","3BrygadySzczerbca","Gdańsk","80-0413BrygadySzczerbcaGdańsk"
"47206","80-802","3Maja","Gdańsk","80-8023MajaGdańsk"
"47207","80-299","Achillesa","Gdańsk","80-299AchillesaGdańsk"
"47208","80-316","AdamaAsnyka","Gdańsk","80-316AdamaAsnykaGdańsk"
"47209","80-405","AdamaMickiewicza","Gdańsk","80-405AdamaMickiewiczaGdańsk"
"47210","80-425","AdamaMickiewicza","Gdańsk","80-425AdamaMickiewiczaGdańsk"
"47211","80-456","AdolfaDygasińskiego","Gdańsk","80-456AdolfaDygasińskiegoGdańsk"

我的代码结果的前几行:

"","s2.i","s1.i","s2name","s1name","adist"
"1",1333,614,"80-152PowstańcówWarszawskichGdańsk","80-158PowstańcówWarszawskichGdańsk",1
"2",257,613,"80-180CzerskaGdańsk","80-180ZEUSAGdańsk",3
"3",1916,612,"80-119WojskiegoGdańsk","80-355BeniowskiegoGdańsk",8
"4",1916,611,"80-119WojskiegoGdańsk","80-180PorębskiegoGdańsk",6
"5",181,610,"80-204BraciŚniadeckichGdańsk","80-210ŚniadeckichGdańsk",7
"6",181,609,"80-204BraciŚniadeckichGdańsk","80-210ŚniadeckichGdańsk",7
"7",21,608,"80-401alGenJózefaHalleraGdańsk","80-401GenJózefaHalleraGdańsk",2
"8",1431,607,"80-264RomanaDmowskiegoGdańsk","80-264DmowskiegoGdańsk",6
"9",1610,606,"80-239StefanaCzarnieckiegoGdańsk","80-239StefanaCzarnieckiegoGdańsk",0

【问题讨论】:

  • 是的,您可以使用apply 函数在整个数据帧中对您提到的任何一个函数进行矢量化。您可以在这里与我们分享任何代码吗?
  • 除非您发布一些数据,我们建议一些代码,然后在 8GB 机器上尝试,否则无法知道。
  • 在开篇文章中添加了我的代码。内存错误已经发生在第一行(计算 adist)。
  • 添加了一个示例(从 R 导出到 CSV 的表格,抱歉它看起来不像普通表格那么清晰,但我认为它很简单,没关系)。
  • 或者另一个想法 - 不是试图一次计算整个巨大的矩阵,而是逐行计算,然后立即只留下最小值而丢弃其余的,这意味着我们得到 1x600k 900kx600k 矩阵。这听起来与矢量化相反,而且,从我对 R 的不太丰富的经验来看,这可能会降低代码的性能。

标签: r string-matching fuzzy-search


【解决方案1】:

我会尝试 StackOverflow 的 @drob 提供的很棒的 fuzzyjoin

library(dplyr)

dict_df <- tibble::tribble(
     ~ID,~Zipcode,~Street,~City,~Adres,
"33854","80-221","Traugutta","Gdańsk","80-221TrauguttaGdańsk",
"157093","80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk",
"200115","80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk",
"344514","80-318","Wąsowicza","Gdańsk","80-318WąsowiczaGdańsk",
"355415","80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk",
"356414","80-452","Kilińskiego","Gdańsk","80-452KilińskiegoGdańsk") %>% 
  select(ID, Adres)

    noise_df <- tibble::tribble(
  ~Zipcode,~Street,~City,~Adres,
  "80-221","Trauguta","Gdansk","80-221TraugutaGdansk",
  "80-211","Traugguta","Gdansk","80-211TrauggutaGdansk",
  "80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk",
  "80-267","KsBernardaSyschty","Gdańsk","80-276KsBernardaSyschtyGdańsk",
  "80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk",
  "80-399","Grunwaldzka","dansk","80-399Grunwaldzkadańsk",
  "80-318","Wasowicza","Gdańsk","80-318WasowiczaGdańsk",
  "80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk",
  "80-625","Stryewskogo","Gdansk","80-625StryewskogoGdansk",
  "80-452","Kilinskiego","Gdańsk","80-452KilinskiegoGdańsk")

library(fuzzyjoin)

noise_df %>% 
  # using jaccard with max_dist=0.5. Try other distance methods with different max_dist to save memory use
  stringdist_left_join(dict_df, by="Adres", distance_col="dist", method="jaccard", max_dist=0.5) %>%
  select(Adres.x, ID, Adres.y, dist) %>% 
  group_by(Adres.x) %>% 
  # select best fit record
  top_n(-1, dist)

结果表由原始地址 (Adres.x) 和字典中的最佳匹配 (IDAdres.y) 以及字符串距离组成。

# A tibble: 10 x 4
# Groups:   Adres.x [10]
                         Adres.x     ID                      Adres.y       dist
                           <chr>  <chr>                        <chr>      <dbl>
 1          80-221TraugutaGdansk  33854        80-221TrauguttaGdańsk 0.11764706
 2         80-211TrauggutaGdansk  33854        80-221TrauguttaGdańsk 0.11764706
 3  80-276KsBernardaSychtyGdańsk 157093 80-276KsBernardaSychtyGdańsk 0.00000000
 4 80-276KsBernardaSyschtyGdańsk 157093 80-276KsBernardaSychtyGdańsk 0.00000000
 5       80-339GrunwaldzkaGdańsk 200115      80-339GrunwaldzkaGdańsk 0.00000000
 6        80-399Grunwaldzkadańsk 200115      80-339GrunwaldzkaGdańsk 0.00000000
 7         80-318WasowiczaGdańsk 344514        80-318WąsowiczaGdańsk 0.05555556
 8     80-625StryjewskiegoGdańsk 355415    80-625StryjewskiegoGdańsk 0.00000000
 9       80-625StryewskogoGdansk 355415    80-625StryjewskiegoGdańsk 0.17391304
10       80-452KilinskiegoGdańsk 356414      80-452KilińskiegoGdańsk 0.05263158

我发现将所有内容都转换为小写 ASCII(iconv()tolower())时,模糊匹配效果最好

更新:这可能有更小的内存占用:

library(purrr)
library(dplyr)
  noise_df %>% split(.$Adres) %>% 
  # using jaccard with max_dist=0.5. Try other distance methods with different max_dist to save memory use
  map_df(~stringdist_left_join(.x, dict_df, by="Adres", distance_col="dist", method="jaccard", max_dist=0.5, ignore_case = TRUE) %>%
          select(Adres.x, ID, Adres.y, dist) %>% 
          group_by(Adres.x) %>% 
          # select best fit record
          top_n(-1, dist))

UPDATE2:使用“lv”距离算法时,您会得到太多缺失值和 NA。在某些情况下,如果找不到匹配项,string_dist_join 会删除您创建的 distance 列。这就是管道的其余部分失败的原因,首先是select,后来是top_n。为了了解发生了什么,请对您的数据进行小样本,将 map_df 更改为 map 并浏览结果列表。

【讨论】:

  • 谢谢,这个功能看起来确实很有趣。我有麻烦,理解为什么它返回比我想要的更多的行。我的示例有 614 行,我在原始代码的结果中有尽可能多的行,但是这个(适合我的使用)返回 1196 行:test &lt;- tableA %&gt;% stringdist_left_join(tableB1, by="Adres", distance_col="dist", method="lv", max_dist=99, ignore_case = TRUE) %&gt;% select(Adres.x, Adres.y, dist) %&gt;% group_by(Adres.x) %&gt;% # select best fit record top_n(-1, dist) 我删除了 ID 是否重要(在我的原始代码中没有它表)?
  • Komb_ulic1$Adres 转换为小写后是否有重复项?
  • 我没有,到目前为止我还没有遇到任何大小写问题,只要我指定函数不应该区分大小写。例如,它正确返回距离 0,其中唯一的区别是大写/小写。现在我还在原始地址中添加了 ID,因为这可能在以后有用。
  • 似乎在原始表 B 中某些地址出现了多次,因此修复此问题应该很容易。
  • 删除这些重复项后,我发现如果在相同的最小距离内找到两个匹配项,它会返回更多行(只有 max_dist 值更大的问题)。此外,尽管它比前一个优化得更好,但仍然存在内存问题 - 以前需要 1897.0 Gb 分配,这个“仅”326.2 Gb,无论 max_dist 设置如何。知道如何进一步优化它吗?
猜你喜欢
  • 2018-04-26
  • 1970-01-01
  • 1970-01-01
  • 2020-11-01
  • 1970-01-01
  • 2022-01-06
  • 1970-01-01
  • 2014-05-18
相关资源
最近更新 更多