【问题标题】：Matching string with LEVENSHTEIN algorithm使用 LEVENSHTEIN 算法匹配字符串
【发布时间】：2019-01-15 03:56:35
【问题描述】：

create table tbl1
(
    name varchar(50)
);

insert into tbl1 values ('Mircrosoft SQL Server'),
                        ('Office Microsoft');

create table tbl2
(
    name varchar(50)
);

insert into tbl2 values ('SQL Server Microsoft'),
                        ('Microsoft Office');

我想得到两个表列name之间匹配字符串的百分比。

我尝试了LEVENSHTEIN 算法。但是我想从给定数据中实现的表在表之间是相同的，但顺序不同，所以我希望看到输出为 100% 匹配。

已尝试：LEVENSHTEIN

SELECT  [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) MatchedPercentage,a.name as tbl1_name,b.name as tbl2_name
FROM tbl1 a
CROSS JOIN tbl2 b 
WHERE [dbo].[GetPercentageOfTwoStringMatching](a.name , b.name) >= 0;

结果：

MatchedPercentage   tbl1_name               tbl2_name
-----------------------------------------------------------------
5                   Mircrosoft SQL Server   SQL Server Microsoft
10                  Office Microsoft        SQL Server Microsoft
15                  Mircrosoft SQL Server   Microsoft Office
13                  Office Microsoft        Microsoft Office

【问题讨论】：

您需要做的第一件事是定义您想要的算法。 “表之间的数据相同但顺序不同”是指字符还是单词？
@Ben，关于文字。
@Squirrel 这是非常短视和懒惰的建议。 GetPercentageOfTwoStringMatching 被极大地过度设计（更不用说对任何未来的维护者来说都是模棱两可的），只是为了在两个字符串中找到匹配的单词。
@iamdave。是的。这是评论而不是答案
@Squirrel 这仍然是一个不好的建议，因此不应该首先给出。尤其是如果您知道这是个糟糕的建议，因为 OP 可能没有经验可以更好地了解。

标签： sql-server sql-server-2008-r2

【解决方案1】：

正如 cmets 中提到的，这可以通过使用字符串拆分表值函数来实现。就我个人而言，我使用了一种基于由Jeff Moden 整理的非常高效的基于集合的计数表方法，这是我的答案的结尾。

使用此功能，您可以比较由空格字符分隔的单个单词，并计算匹配数与两个值中单词总数的比较。

但请注意，此解决方案适用于任何带有前导空格的值。如果这将是一个问题，请在运行此脚本之前清理您的数据或调整以处理它们：

declare @t1 table(v nvarchar(50));
declare @t2 table(v nvarchar(50));

insert into @t1 values('Microsoft SQL Server'),('Office Microsoft'),('Other values');    -- Add in some extra values, with the same number of words and some with the same number of characters
insert into @t2 values('SQL Server Microsoft'),('Microsoft Office'),('that matched'),('that didn''t'),('Other valuee');

with c as
(
    select t1.v as v1
            ,t2.v as v2
            ,len(t1.v) - len(replace(t1.v,' ','')) + 1 as NumWords  -- String Length - String Length without spaces = Number of words - 1
    from @t1 as t1
        cross join @t2 as t2    -- Cross join the two tables to get all comparisons
    where len(replace(t1.v,' ','')) = len(replace(t2.v,' ','')) -- Where the length without spaces is the same. Can't have the same words in a different order if the number of non space characters in the whole string is different
)
select c.v1
        ,c.v2
        ,c.NumWords
        ,sum(case when s1.item = s2.item then 1 else 0 end) as MatchedWords
from c
    cross apply dbo.fn_StringSplit4k(c.v1,' ',null) as s1
    cross apply dbo.fn_StringSplit4k(c.v2,' ',null) as s2
group by c.v1
        ,c.v2
        ,c.NumWords
having c.NumWords = sum(case when s1.item = s2.item then 1 else 0 end);

输出

+----------------------+----------------------+----------+--------------+
|          v1          |          v2          | NumWords | MatchedWords |
+----------------------+----------------------+----------+--------------+
| Microsoft SQL Server | SQL Server Microsoft |        3 |            3 |
| Office Microsoft     | Microsoft Office     |        2 |            2 |
+----------------------+----------------------+----------+--------------+

功能

create function dbo.fn_StringSplit4k
(
     @str nvarchar(4000) = ' '              -- String to split.
    ,@delimiter as nvarchar(1) = ','        -- Delimiting value to split on.
    ,@num as int = null                     -- Which value to return.
)
returns table
as
return
                    -- Start tally table with 10 rows.
    with n(n)   as (select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1 union all select 1)

                    -- Select the same number of rows as characters in @str as incremental row numbers.
                    -- Cross joins increase exponentially to a max possible 10,000 rows to cover largest @str length.
        ,t(t)   as (select top (select len(isnull(@str,'')) a) row_number() over (order by (select null)) from n n1,n n2,n n3,n n4)

                    -- Return the position of every value that follows the specified delimiter.
        ,s(s)   as (select 1 union all select t+1 from t where substring(isnull(@str,''),t,1) = @delimiter)

                    -- Return the start and length of every value, to use in the SUBSTRING function.
                    -- ISNULL/NULLIF combo handles the last value where there is no delimiter at the end of the string.
        ,l(s,l) as (select s,isnull(nullif(charindex(@delimiter,isnull(@str,''),s),0)-s,4000) from s)

    select rn
          ,item
    from(select row_number() over(order by s) as rn
                ,substring(@str,s,l) as item
        from l
        ) a
    where rn = @num
        or @num is null;

【讨论】：

我为大表提出了相同的方案，当我说大表意味着 table1 是 1000 万条记录时，table2 是 2000 万条记录。什么是最好的最优解决方案？我尝试使用上述查询，但超过半小时，查询仍在运行。
@MAK Optimal 完全取决于您的硬件和设置，恐怕您需要根据较小的样本数据集进行测试和提出。您正在尝试做的事情很复杂，因此在最好的情况下会占用大量资源。如果这是一次临时练习，您可以通过执行一些中间步骤来准备数据，将数据保存在表中，然后为下一步编制索引。如果这是一项照常营业的功能，我建议您尝试从源头上解决您的问题。