你如何在r中执行模糊字符串匹配答案

【问题标题】：How do you perform fuzzy string matching in r你如何在r中执行模糊字符串匹配
【发布时间】：2015-01-25 14:05:05
【问题描述】：

我有两个包含多列的数据框。我在下面提供了一个较短版本的数据框，其中包含问题的相关列。

STR(DF1)

'data.frame':   10 obs. of  6 variables:
 $ itemid      : int  1 1 1 1 1 1 1 1 1 1
 $ userid      : int  650 635 1 514 250 210 5 72 77 252
 $ rating      : int  3 4 5 5 4 5 4 4 5 5
 $ time        : Date, format: "1998-03-31" "1997-11-07" "1997-09-22" ...
 $ title       : chr  "Toy Story " "Toy Story " "Toy Story " "Toy Story " ...
 $ release_date: chr  "1995" "1995" "1995" "1995" ...

DF1

 itemid userid rating       time      title release_date
1       1    650      3 1998-03-31 Toy Story          1995
2       1    635      4 1997-11-07 Toy Story          1995
3       1      1      5 1997-09-22 Toy Story          1995
4       1    514      5 1997-09-26 Toy Story          1995
5       1    250      4 1997-12-27 Toy Story          1995
6       1    210      5 1998-02-17 Toy Story          1995
7       1      5      4 1997-09-30 Toy Story          1995
8       1     72      4 1997-11-20 Toy Story          1995
9       1     77      5 1998-01-13 Toy Story          1995
10      1    252      5 1998-04-01 Toy Story          1995

STR(DF2)

'data.frame':   10 obs. of  6 variables:
 $ itemid      : int  2844 4936 4936 4972 5078 6684 6689 7264 7264 7880
 $ userid      : int  4477 8871 11628 16885 11628 4222 4222 2092 5943 11628
 $ rating      : int  6 8 5 8 4 6 6 8 6 7
 $ time        : Date, format: "2013-03-09" "2013-05-05" "2013-07-06" ...
 $ title       : chr  "FantÃ´mas - Ã€ l'ombre de la guillotine " "The Bank " "The Bank " "The Birth of a Nation " ...
 $ release_date: chr  "1913" "1915" "1915" "1915" ...

DF2

 itemid userid rating       time                                    title release_date
1    2844   4477      6 2013-03-09 FantÃ´mas - Ã€ l'ombre de la guillotine          1913
2    4936   8871      8 2013-05-05                                The Bank          1915
3    4936  11628      5 2013-07-06                                The Bank          1915
4    4972  16885      8 2013-08-19                   The Birth of a Nation          1915
5    5078  11628      4 2013-08-23                               The Cheat          1915
6    6684   4222      6 2013-08-24                             The Fireman          1916
7    6689   4222      6 2013-08-24                         The Floorwalker          1916
8    7264   2092      8 2013-03-17                                The Rink          1916
9    7264   5943      6 2013-05-12                                The Rink          1916
10   7880  11628      7 2013-07-19                             Easy Street          1917

我想使用模糊字符串匹配和 Levenshtein 距离度量来匹配数据集中的标题，并且还想确认标题与“release_date”匹配。有没有更好的方法在不使用循环的情况下执行此任务？我尝试使用带有“agrep”的 for 循环，但内存不足。输出应该是一个数据帧，但仅限于匹配的电影。

原始数据帧有超过 100K 行。

谢谢。

【问题讨论】：

你能显示你的代码吗？还有一些其他的包在 r 中使用了 levenshtein distance，你试过了吗？
我查看了 compare.linkage 但我无法理解产生的输出，我尝试了 'agrep' 但只有一个字符串值。根据我的理解， compare.linkage 看起来是一个更好的选择，因为它比较了两个数据集和多个列。我完全不知道如何解释输出。我是第一次执行此操作，因此需要像您这样的专家的帮助和指导，以便将来学习并能够自己完成。我也尝试过使用 compare.linkage，但出现了这个错误“错误：无法分配大小为 43.7 Gb 的向量”
看看help(agrep)
"pattern - 非空字符串或包含要匹配的正则表达式（对于固定 = FALSE）的字符串。如果可能，由 as.character 强制转换为字符串"。根据我的理解，我需要使用 for 循环，这一次只能匹配另一个数据集中的一个标题。考虑到数据集的大小，我想避免出现循环

标签： r string matching

【解决方案1】：

试试agrep函数

title <- c("The Bank", "The Cheat", "The Rink", "The Ring", "Toy Story", "Toy Story 2")
for(i in seq_along(title)){
    x <- agrep(title[i], title[-i], value = TRUE)   
    cat("Title :", title[i], " matched to ", x, "\n")
}
Title : The Bank  matched to   
Title : The Cheat  matched to   
Title : The Rink  matched to  The Ring 
Title : The Ring  matched to  The Rink 
Title : Toy Story  matched to  Toy Story 2 
Title : Toy Story 2  matched to  Toy Story

【讨论】：