如何根据SQL中的百分比匹配比较两个字符串答案

【问题标题】：How to compare two strings based on percent match in SQL如何根据SQL中的百分比匹配比较两个字符串
【发布时间】：2020-11-20 20:45:29
【问题描述】：

我想针对我在 T-SQL 中遇到的一个有趣问题发布解决方案。

问题：根据百分比匹配比较两个字符串字段。此外，这两个字符串中的单词可能translocated。

例如：“Joni Bravo”和“Bravo Joni”。这两个字符串应该返回 100% 的匹配，这意味着位置不相关。还有一点值得注意的是，这段代码用于比较其中有空格作为分隔符的字符串。如果第一个字符串没有空格，则匹配设置为 100%，无需实际检查。这不是开发的，因为此函数要比较的字符串总是包含两个或多个单词。此外，它是在 MS SQL Server 2017 上编写的。

【问题讨论】：

这不是一个真正适合 SQL 的用例（尽管它可以完成）。您是否研究过全文搜索？
我确实尝试过使用全文和目录类型搜索，但最终还是不能满足我要做的事情的需要。
@GeorgiAngelov 非常有用！但是已经进行了性能基准测试，例如时间与记录数以及字符串中的字数/长度
@GeorgiAngelov 我几乎可以肯定，您可以通过使用 Tally 表而不是 while 循环来提高性能
@DhruvJoshi 实际上，Tally 表通常更快，我只是没有找到它们，这对我来说会更难 :) 任何可以并且想要的人都应该调整此代码以获得性能提升 :)

标签： sql sql-server

【解决方案1】：

所以这是解决方案，希望这对任何人都有帮助:) GL

    /****** Object:  UserDefinedFunction [dbo].[STRCOMP]    Script Date: 29/03/2018 15:31:45 ******/
    SET ANSI_NULLS ON
    GO
    
    SET QUOTED_IDENTIFIER ON
    GO
    
    CREATE FUNCTION [dbo].[STRCOMP] (
        -- Add the parameters for the function here
        @name_1 varchar(255),@name_2 varchar(255)
    )
    RETURNS float
    AS
    BEGIN
        

-- Declare the return variable and any needed variable here
    declare @p int = 0;
    declare @c int = 0;
    declare @br int = 0;
    declare @p_temp int = 0;
    declare @emergency_stop int = 0;
    declare @fixer int = 0;
    declare @table1_temp table (
    row_id int identity(1,1),
    str1 varchar (255));
    declare @table2_temp table (
    row_Id int identity(1,1),
    str2 varchar (255));
    declare @n int = 1;
    declare @count int = 1;
    declare @result int = 0;
    declare @total_result float = 0;
    declare @result_temp int = 0;
    declare @variable float = 0.0;
    
--clean the two strings from unwanted symbols and numbers

    set @name_1 = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@name_1,'!',''),'  ',' '),'1',''),'2',''),'3',''),'4',''),'5',''),'0',''),'6',''),'7',''),'8',''),'9','');
    set @name_2 = REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(REPLACE(@name_2,'!',''),'  ',' '),'1',''),'2',''),'3',''),'4',''),'5',''),'0',''),'6',''),'7',''),'8',''),'9','');

--check if the first string has more than one words inside. If the string does 
--not have more than one words, return 100%
set @c = charindex(' ',substring(@name_1,@p,len(@name_1)));


IF(@c = 0)
BEGIN
RETURN 100.00
END;

--main logic of the operation. This is based on sound indexing and comparing the 
--outcome. This loops through the string whole words and determines their soundex
--code and then compares it one against the other to produce a definitive number --showing the raw match between the two strings @name_1 and @name_2.
WHILE (@br != 2 or @emergency_stop = 20)
BEGIN

insert into @table1_temp(str1)
select substring (@name_1,@p,@c);
set @p = len(substring (@name_1,@p,@c))+2;
set @p = @p + @p_temp - @fixer;
set @p_temp = @p;
set @c = CASE WHEN charindex(' ',substring(@name_1,@p,len(@name_1))) = 0 THEN len(@name_1) ELSE charindex(' ',substring(@name_1,@p,len(@name_1))) END;
set @fixer = 1;
set @br = CASE WHEN charindex(' ',substring(@name_1,@p,len(@name_1))) = 0 THEN @br + 1 ELSE 0 END;
set @emergency_stop = @emergency_stop +1;
END;

set @p = 0;
set @br = 0;
set @emergency_stop = 0;
set @fixer = 0;
set @p_temp = 0;
set @c = charindex(' ',substring(@name_2,@p,len(@name_2)));

WHILE (@br != 2 or @emergency_stop = 20)
BEGIN

insert into @table2_temp(str2)
select substring (@name_2,@p,@c);
set @p = len(substring (@name_2,@p,@c))+2;
set @p = @p + @p_temp - @fixer;
set @p_temp = @p;
set @c = CASE WHEN charindex(' ',substring(@name_2,@p,len(@name_2))) = 0 THEN len(@name_2) ELSE charindex(' ',substring(@name_2,@p,len(@name_2))) END;
set @fixer = 1;
set @br = CASE WHEN charindex(' ',substring(@name_2,@p,len(@name_2))) = 0 THEN @br + 1 ELSE 0 END;
set @emergency_stop = @emergency_stop +1;
END;

WHILE((select str1 from @table1_temp where row_id = @n) is not null)
BEGIN
    set @count = 1;
    set @result = 0;
    WHILE((select str2 from @table2_temp where row_id = @count) is not null)
    BEGIN
        set @result_temp = DIFFERENCE((select str1 from @table1_temp where row_id = @n),(select str2 from @table2_temp where row_id = @count));
        IF(@result_temp > @result)
            BEGIN
                set @result = @result_temp;
                
            END;
            
        set @count = @count + 1;         
    END;
    
    set @total_result = @total_result + @result;
    set @n = @n + 1;
END;

--gather the results and transform them in a percent match.
set @variable = (select @total_result / (select max(row_count) from (
select max(row_id) as row_count from @table1_temp
union
select max(row_id) as row_count from @table2_temp) a));
RETURN @variable/4 * 100;

END
GO

PS：我决定把它写成一个用户定义的函数只是为了我的项目的需要。

【讨论】：

我完全不明白这个输出。好像不会低于50%？所以像“蝙蝠侠”和“xxx zzz”这样的东西会返回 50%。这有什么意义？他们根本没有共同的性格。这似乎是在比较字符而不是整个单词。这是故意的吗？我不明白百分比结果。 'asdf cxxxccc'、'fdsa asdf' 等值返回 75%。我想如果它适用于你的情况很好，但这对我来说没有意义。
你好@SeanLange。为了了解这种比较的工作原理，您应该阅读内置函数 DIFFERENCE。此代码基于它。以下是有关它的可用文档的链接：docs.microsoft.com/en-us/sql/t-sql/functions/… 希望对您有所帮助。