R模糊字符串匹配以根据匹配的字符串返回特定列答案

【问题标题】：R fuzzy string match to return specific column based on matched stringR模糊字符串匹配以根据匹配的字符串返回特定列
【发布时间】：2017-08-02 15:09:48
【问题描述】：

我有两个大型数据集，一个大约一百万条记录，另一个大约 70K。这些数据集有地址。如果较小数据集中的任何地址存在于较大数据集中，我想匹配。正如您想象的那样，地址可以以不同的方式和不同的情况/拼写等方式编写。除此之外，如果只写到建筑物级别，则可以复制此地址。所以不同的公寓有相同的地址。我做了一些研究，并找出了可以使用的包 stringdist。

我做了一些工作，并设法根据距离获得最接近的匹配。但是我无法返回地址匹配的相应列。

下面是一个示例虚拟数据以及我为解释这种情况而创建的代码

library(stringdist)
Address1 <- c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr")
Year1 <- c(2001:2007)

Address2 <- c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR")
Year2 <- c(2001:2010)

df1 <- data.table(Address1,Year1)
df2 <- data.table(Address2,Year2)
df2[,unique_id := sprintf("%06d", 1:nrow(df2))]

fn_match = function(str, strVec, n){
  strVec[amatch(str, strVec, method = "dl", maxDist=n,useBytes = T)]
}

df1[!is.na(Address1)
    , address_match := 
      fn_match(Address1, df2$Address2,3)
    ]

这将返回基于距离 3 的封闭字符串匹配，但是我希望在 df1 中也有来自 df2 的“Year”和“unique_id”列。这将帮助我知道字符串与 df2 中的哪一行数据匹配。所以最后我想知道 df1 中的每一行 根据指定的距离与 df2 最接近的匹配是什么，并且对于匹配的行具有特定的 “年份”和来自df2的“unique_id”。

我想这与合并（左连接）有关，但我不确定如何合并以保留重复项并确保我具有与 df1（小型数据集）中相同的行数。

任何一种解决方案都会有所帮助！

【问题讨论】：

现在不在我的电脑上，但请参阅 ?which.min 以包装您上一个问题中的 stringdist()。还要考虑你想如何处理关系。
@C8H10N4O2，谢谢您的建议。是的，which.min 有助于了解最小值，但在这种情况下，我希望匹配字符串中的对应列很少。由于大型数据集中有重复的地址，我希望拥有 unique_id 能够区分匹配的行，然后我可以根据 unique_id 从大型数据集中合并其他需要的列。
@C8H10N4O2，我真的希望您能就此提出一些解决方案。即使我们能够从大型数据集中返回匹配字符串的行号，它也应该可以帮助我根据行号合并所需的列。

标签： r merge data.table string-matching stringdist

【解决方案1】：

你已经完成了 90% 的路......

你说你想

知道字符串与df2中的哪一行数据匹配

您只需要了解您已有的代码。见?amatch:

amatch 返回x 在table 中最接近匹配的位置。当存在多个具有相同最小距离度量的匹配时，返回第一个。

换句话说，amatch 为您提供df2（即您的table）中与df1（即您的x）中每个地址最接近的匹配的行的索引。您通过返回新地址来过早地包装此索引。

取而代之，检索索引本身以进行查找或为左连接检索 unique_id（如果您确信它确实是唯一 id）。

两种方法的说明：

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
  amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
    unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
    df2_pos := match_pos(Address1, df2$Address2,3) ] 
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]

【讨论】：

非常感谢您的解决方案和解释。真的很有帮助！！再次感谢您。
@user1412 不客气，另外，如果您需要检查唯一性，请参阅?duplicated，如!anyDuplicated(...)
感谢您的帮助！我也在探索 stringdistmatrix 来创建矩阵，然后采取最小距离。我已经实现并且代码正在运行。为此创建了一个函数。但是现在我需要根据各个地区的面积进行匹配。所以想在现有功能上再增加一个功能。我设法创建了一个函数，但函数发现它很难......还有很多东西要学习......我已经发布了这个问题。 stackoverflow.com/questions/42793833/…请帮忙！！

【解决方案2】：

这是使用fuzzyjoin 包的解决方案。它使用dplyr-like 语法和stringdist 作为可能的模糊匹配类型之一。

您可以使用stringdistmethod="dl"（或其他可能更好的方法）。

为了满足您“确保与 df1 中的行数相同”的要求，我使用了较大的 max_dist，然后使用 dplyr::group_by 和 dplyr::top_n 以仅获得最小距离的最佳匹配。这是suggested 的fuzzyjoin 的开发人员dgrtwo@。（希望将来它会成为包本身的一部分。）

（我还必须假设在距离平局的情况下取最大 year2。）

代码：

library(data.table, quietly = TRUE)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) 
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)]

library(fuzzyjoin, quietly = TRUE); library(dplyr, quietly = TRUE)
stringdist_join(df1, df2, 
                by = c("Address1" = "Address2"), 
                mode = "left", 
                method = "dl", 
                max_dist = 99, 
                distance_col = "dist") %>%
  group_by(Address1, Year1) %>%
  top_n(1, -dist) %>%
  top_n(1, Year2)

结果：

# A tibble: 7 x 6
# Groups:   Address1, Year1 [7]
                                Address1 Year1                             Address2 Year2 unique_id  dist
                                   <chr> <int>                                <chr> <int>     <chr> <dbl>
1                    786, GALI NO 5, XYZ  2001                   786, GALI NO 4 XYZ  2007    000007     2
2       rambo, 45, strret 4, atlast, pqr  2002 del, 546, strret2, towards east, pqr  2009    000009    17
3 23/4, 23RD FLOOR, STREET 2, ABC-E, PQR  2003                  23/4, STREET 2, PQR  2010    000010    19
4                    45-B, GALI NO5, XYZ  2004                  45B, GALI NO 5, XYZ  2008    000008     2
5                 HECTIC, 99 STREET, PQR  2005                  23/4, STREET 2, PQR  2010    000010    11
6                    786, GALI NO 5, XYZ  2006                   786, GALI NO 4 XYZ  2007    000007     2
7       rambo, 45, strret 4, atlast, pqr  2007 del, 546, strret2, towards east, pqr  2009    000009    17

【讨论】：