【发布时间】:2018-06-07 20:47:02
【问题描述】:
我需要按照以下模式进行模糊匹配:表 A 包含带有地址的字符串(我已经预先格式化,例如删除空格等),我必须验证它们的正确性。我有表 B,其中包含所有可能的地址(格式与表 A 相同),所以我不想只将表 A 中的第 1 行与表 B 中的第 1 行相匹配,依此类推,而是比较表中的每一行A 到整个表 B 并为每个表找到最接近的匹配项。
根据我的检查,adist 和 agrep 默认情况下是逐行工作的,通过尝试使用它们,我也会立即得到内存不足的消息。是否有可能在 R 中仅使用 8 GB RAM?
我找到了一个类似问题的示例代码,并以此为基础解决了我的问题,但性能仍然存在问题。它适用于表 A 中的 600 行和表 B 中的 2000 行的样本,但完整的数据集分别为 600000 和 900000 行。
adresy_odl <- adist(TableA$Adres, TableB$Adres, partial=FALSE, ignore.case = TRUE)
min_odl<-apply(adresy_odl, 1, min)
match.s1.s2<-NULL
for(i in 1:nrow(adresy_odl))
{
s2.i<-match(min_odl[i],adresy_odl[i,])
s1.i<-i
match.s1.s2<-rbind(data.frame(s2.i=s2.i,s1.i=s1.i,s2name=TableB[s2.i,]$Adres, s1name=TableA[s1.i,]$Adres, adist=min_odl[i]),match.s1.s2)
}
内存错误已经发生在第一行(adist 函数):
Error: cannot allocate vector of size 1897.0 Gb
下面是我使用的数据示例 (CSV),tableA 和 tableB 看起来完全一样,唯一的区别是 tableB 包含 Zipcode、Street 和 City 的所有可能组合,而在 tableA 中,大多数是错误的邮政编码或街道拼写有一些错误。
表A:
"","Zipcode","Street","City","Adres"
"33854","80-221","Traugutta","Gdańsk","80-221TrauguttaGdańsk"
"157093","80-276","KsBernardaSychty","Gdańsk","80-276KsBernardaSychtyGdańsk"
"200115","80-339","Grunwaldzka","Gdańsk","80-339GrunwaldzkaGdańsk"
"344514","80-318","Wąsowicza","Gdańsk","80-318WąsowiczaGdańsk"
"355415","80-625","Stryjewskiego","Gdańsk","80-625StryjewskiegoGdańsk"
"356414","80-452","Kilińskiego","Gdańsk","80-452KilińskiegoGdańsk"
表B:
"","Zipcode","Street","City","Adres"
"47204","80-180","11Listopada","Gdańsk","80-18011ListopadaGdańsk"
"47205","80-041","3BrygadySzczerbca","Gdańsk","80-0413BrygadySzczerbcaGdańsk"
"47206","80-802","3Maja","Gdańsk","80-8023MajaGdańsk"
"47207","80-299","Achillesa","Gdańsk","80-299AchillesaGdańsk"
"47208","80-316","AdamaAsnyka","Gdańsk","80-316AdamaAsnykaGdańsk"
"47209","80-405","AdamaMickiewicza","Gdańsk","80-405AdamaMickiewiczaGdańsk"
"47210","80-425","AdamaMickiewicza","Gdańsk","80-425AdamaMickiewiczaGdańsk"
"47211","80-456","AdolfaDygasińskiego","Gdańsk","80-456AdolfaDygasińskiegoGdańsk"
我的代码结果的前几行:
"","s2.i","s1.i","s2name","s1name","adist"
"1",1333,614,"80-152PowstańcówWarszawskichGdańsk","80-158PowstańcówWarszawskichGdańsk",1
"2",257,613,"80-180CzerskaGdańsk","80-180ZEUSAGdańsk",3
"3",1916,612,"80-119WojskiegoGdańsk","80-355BeniowskiegoGdańsk",8
"4",1916,611,"80-119WojskiegoGdańsk","80-180PorębskiegoGdańsk",6
"5",181,610,"80-204BraciŚniadeckichGdańsk","80-210ŚniadeckichGdańsk",7
"6",181,609,"80-204BraciŚniadeckichGdańsk","80-210ŚniadeckichGdańsk",7
"7",21,608,"80-401alGenJózefaHalleraGdańsk","80-401GenJózefaHalleraGdańsk",2
"8",1431,607,"80-264RomanaDmowskiegoGdańsk","80-264DmowskiegoGdańsk",6
"9",1610,606,"80-239StefanaCzarnieckiegoGdańsk","80-239StefanaCzarnieckiegoGdańsk",0
【问题讨论】:
-
是的,您可以使用
apply函数在整个数据帧中对您提到的任何一个函数进行矢量化。您可以在这里与我们分享任何代码吗? -
除非您发布一些数据,我们建议一些代码,然后在 8GB 机器上尝试,否则无法知道。
-
在开篇文章中添加了我的代码。内存错误已经发生在第一行(计算 adist)。
-
添加了一个示例(从 R 导出到 CSV 的表格,抱歉它看起来不像普通表格那么清晰,但我认为它很简单,没关系)。
-
或者另一个想法 - 不是试图一次计算整个巨大的矩阵,而是逐行计算,然后立即只留下最小值而丢弃其余的,这意味着我们得到 1x600k 900kx600k 矩阵。这听起来与矢量化相反,而且,从我对 R 的不太丰富的经验来看,这可能会降低代码的性能。
标签: r string-matching fuzzy-search