从噪声字符串中提取电话号码答案

【问题标题】：Extract phone number from noised string从噪声字符串中提取电话号码
【发布时间】：2023-03-16 23:25:02
【问题描述】：

我在表格中有一个列，其中包含随机数据以及不同格式的电话号码。该列可能包含

姓名
电话
电子邮件
HTML 标签
地址（带数字）

例子：

1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 
2) John Smith
3) xxx@yyy.com
4) John Smith 8 999 888 77 77

电话号码的书写方式也取决于。可能像8 927 410 00 22, 8(927)410-00-22, +7(927)410-00-22, +7 (927) 410-00-22, (927)410 00 22, 927 410 00 22, 9(2741) 0 0 0-22等等

这里的通用规则是电话号码格式包含 10-11 位数字。

我最好的猜测是使用正则表达式并首先从字符串中删除电子邮件地址（因为它们可以包含电话号码，例如 79990001122@gmail.com），然后使用一些正则表达式根据知道它是 10 或行中的 11 位数字用 ,(,),+,- 等字符分隔（我认为不会有人使用 . 作为电话数字分隔符，所以我们不想在第一个示例中考虑像 77.106.46.202 这样的 IP 地址） .

所以问题是如何从这些值中获取电话号码。

我想从上面三个例子中得到的最终值是：

1) 79005346546 79005346546 79005346546 
2) 
3) 
4) 89998887777

服务器是Microsoft SQL Server 2014 - 12.0.2000.8 (X64) Standard Edition (64-bit)

【问题讨论】：

电话号码是一个连续的整数流，忽略它们之间的任何`, (, or )`。一种方法是清除位于整数之间的任何这些字符，然后提取仅包含至少 6 个字符长的数字的任何字符串。
Regex 将是解决此问题的理想工具，但 SQL Server 不支持 regex。
@TimBiegeleisen 我搜索了一下，发现我们可以使用github.com/DevNambi/sql-server-regex/blob/master/examples/…为 SQL Server 添加正则表达式支持
@KumarHarsh “[CLR/Regex] 也会很有效。”效率不如 NGrams8K，甚至不接近。我更新了下面的代码以包含 1,000,000 行性能测试。 NGrams 解决方案大约快 24 倍。
@AlanBurstein，非常感谢。我阅读并测试了您的脚本

标签： sql sql-server regex database data-cleaning

【解决方案1】：

更新 (20200226)

有几个 CLR/regex 解决方案可能比我发布的 ngram8k 解决方案更快。我已经听了六年了，但每次测试工具都无一例外地讲述了一个不同的故事。我已经在较早的 cmets 说明中发布了让 CLR Regex 的 Microsoft© MDQ 系列在几分钟内运行的说明。它们由 Microsoft 开发、测试和调整，并附带主数据服务/数据质量服务。我已经用了很多年了，它们很好。

RegexReplace/RegexSplit 与 PatExtract8k/DigitsOnlyEE：1,000,000 行

显然，您不希望 WHEREclause 中包含函数，但是，由于我的正则表达式是生锈的 AF，所以我需要。为了公平竞争，我在 N-Gram 解决方案的 WHERE 子句中对 DigitsOnlyEE 做了同样的事情。

SET NOCOUNT ON;
DBCC FREEPROCCACHE    WITH NO_INFOMSGS;
DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS;
SET STATISTICS TIME ON;

DECLARE
  @newData BIT            = 0,
  @string  VARCHAR(8000)  = '1) Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 ',
  @pattern VARCHAR(50)    = '[^0-9()+.-]',
  @srchLen INT            = 11;

IF @newData = 1
BEGIN
  IF OBJECT_ID('tempdb..#strings','U') IS NOT NULL DROP TABLE #strings;

  SELECT 
    StringId = IDENTITY(INT,1,1),
    String   = REPLICATE(@string,ABS(CHECKSUM(NEWID())%3)+1)
  INTO   #strings
  FROM   dbo.rangeAB(1,1000000,1,1) AS r;
END

PRINT CHAR(10)+'Regex/CLR version Serial'+CHAR(10)+REPLICATE('-',90);
SELECT regex.NewString
FROM   #strings AS s
CROSS APPLY
(
  SELECT STRING_AGG(clr.RegexReplace(f.Token,'[^0-9]','',0),' ')
  FROM   clr.RegexSplit(s.string,@pattern,N'[0-9()+.-]',0) AS f
  WHERE  f.IsValid = 1
  AND    LEN(clr.RegexReplace(f.Token,'[^0-9]','',0)) = @srchLen
) AS regex(NewString);

PRINT CHAR(10)+'NGrams version Serial'+CHAR(10)+REPLICATE('-',90);
SELECT ngramsStuff.NewString
FROM   #strings AS s
CROSS APPLY
(
  SELECT      STRING_AGG(ee.digitsOnly,' ')
  FROM        samd.patExtract8K(@string,@pattern) AS pe
  CROSS APPLY samd.digitsOnlyEE(pe.item)          AS ee
  WHERE       LEN(ee.digitsOnly) = @srchLen
) AS ngramsStuff(NewString)
OPTION (MAXDOP 1);

SET STATISTICS TIME OFF;
GO

测试结果

Regex/CLR version Serial
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 19918 ms,  elapsed time = 12355 ms.

NGrams version Serial
------------------------------------------------------------------------------------------
 SQL Server Execution Times: CPU time = 844 ms,  elapsed time = 971 ms.

NGrams8k 非常快，不需要您编译新程序集、学习新的编程语言、启用 CLR 功能等......垃圾收集没有问题。即使是 MDS/DQS 附带的 CLR N-GRAMs 功能也无法触及 NGrams8k 的性能（请参阅我文章下的 cmets）。

更新结束

首先获取ngrams8k 的副本并使用它来构建PatExtract8k（本文底部的DDL。）接下来快速热身：

DECLARE
  @string  VARCHAR(8000)  = 'Call me later at 222-3333 or tomorrow at 312.555.2222, 
                             (313)555-6789, or at 1+800-555-4444 before noon. Thanks!',
  @pattern VARCHAR(50)    = '%[^0-9()+.-]%';


SELECT pe.itemNumber, pe.itemIndex, pe.itemLength, pe.item
FROM   samd.patExtract8K(@string,@pattern) AS pe
WHERE  pe.itemLength > 1;

返回：

ItemNumber  ItemIndex   ItemLength  Item
----------- ----------- ----------- ----------------
1           18          8           222-3333
2           42          12          312.555.2222
3           91          13          (313)555-6789
4           112         14          1+800-555-4444

请注意，该函数返回匹配的模式、字符串中的位置、项目长度和项目。可以利用前三个属性进行进一步处理，从而将我们带到您的帖子中。注意我的 cmets：

-- First for some easily consumable sample data. 
DECLARE @things TABLE (StringId INT IDENTITY, String VARCHAR(8000));
INSERT @things (String)
VALUES
('Call back from +79005346546, Conversation started<br>Phone: +79005346546<br>Called twice Came from google.com<br>IP: 77.106.46.202 the web page address is xxx.com utm_medium: cpc<br>utm_campaign: 32587871<br>utm_content: 5283041 79005346546 '),
('John Smith'),
('xxx@yyy.com'),
('John Smith 8 999 888 77 77');

DECLARE @SrchLen INT = 11;

SELECT
  StringId   = t.StringId, 
  ItemIndex  = pe.itemIndex,
  ItemLength = @SrchLen,
  Item       = i2.Item
FROM        @things AS t
CROSS APPLY samd.patExtract8K(t.String,'[^0-9 ]')                        AS pe
CROSS APPLY (VALUES(PATINDEX('%'+REPLICATE('[0-9]',@SrchLen), pe.item))) AS i(Idx)
CROSS APPLY (VALUES(SUBSTRING(pe.Item,NULLIF(i.Idx,0),11)))              AS ns(NewString)
CROSS APPLY (VALUES(ISNULL(ns.NewString, REPLACE(pe.item,' ',''))))      AS i2(Item)
WHERE       pe.itemLength >= @SrchLen;

返回：

StringId    ItemIndex            ItemLength  Item
----------- -------------------- ----------- -----------
1           17                   11          79005346546
1           62                   11          79005346546
1           221                  11          79005346546
4           11                   11          89998887777

接下来我们可以像这样处理外部行，像这样处理行到列的连接：

WITH t AS
(
  SELECT      i2.Item, t.StringId
  FROM        @things AS t
  CROSS APPLY samd.patExtract8K(t.String,'[^0-9 ]')                        AS pe
  CROSS APPLY (VALUES(PATINDEX('%'+REPLICATE('[0-9]',@SrchLen), pe.item))) AS i(Idx)
  CROSS APPLY (VALUES(SUBSTRING(pe.Item,NULLIF(i.Idx,0),11)))              AS ns(NewString)
  CROSS APPLY (VALUES(ISNULL(ns.NewString, REPLACE(pe.item,' ',''))))      AS i2(Item)
  WHERE       pe.itemLength >= @SrchLen
)
SELECT 
  StringId  = t2.StringId,
  NewString = ISNULL((
    SELECT t.item+' '
    FROM   t
    WHERE  t.StringId = t2.StringId
    FOR XML PATH('')),'')
FROM      @things AS t2
LEFT JOIN t       AS t1 ON t2.StringId = t1.StringId
GROUP BY  t2.StringId;

返回：

StringId  NewString
--------- --------------------------------------
1         79005346546 79005346546 79005346546 
2         
3         
4         89998887777

我希望我有更多的时间来了解更多细节，但这比计划的时间要长一些。欢迎提出任何问题。

Patextract：

CREATE FUNCTION samd.patExtract8K
(
  @string  VARCHAR(8000),
  @pattern VARCHAR(50)
)
/*****************************************************************************************
[Description]:
 This can be considered a T-SQL inline table valued function (iTVF) equivalent of 
 Microsoft's mdq.RegexExtract except that:

 1. It includes each matching substring's position in the string

 2. It accepts varchar(8000) instead of nvarchar(4000) for the input string, varchar(50)
    instead of nvarchar(4000) for the pattern

 3. The mask parameter is not required and therefore does not exist.

 4. You have specify what text we're searching for as an exclusion; e.g. for numeric 
    characters you should search for '[^0-9]' instead of '[0-9]'. 

 5. There is is no parameter for naming a "capture group". Using the variable below, both 
    the following queries will return the same result:

     DECLARE @string nvarchar(4000) = N'123 Main Street';

   SELECT item FROM samd.patExtract8K(@string, '[^0-9]');
   SELECT clr.RegexExtract(@string, N'(?<number>(\d+))(?<street>(.*))', N'number', 1);

 Alternatively, you can think of patExtract8K as Chris Morris' PatternSplitCM (found here:
 http://www.sqlservercentral.com/articles/String+Manipulation/94365/) but only returns the
 rows where [matched]=0. The key benefit of is that it performs substantially better 
 because you are only returning the number of rows required instead of returning twice as
 many rows then filtering out half of them.  Furthermore, because we're 

 The following two sets of queries return the same result:

 DECLARE @string varchar(100) = 'xx123xx555xx999';
 BEGIN
 -- QUERY #1
 -- patExtract8K
   SELECT ps.itemNumber, ps.item 
   FROM samd.patExtract8K(@string, '[^0-9]') ps;

   -- patternSplitCM   
   SELECT itemNumber = row_number() over (order by ps.itemNumber), ps.item 
   FROM dbo.patternSplitCM(@string, '[^0-9]') ps
   WHERE [matched] = 0;

 -- QUERY #2
   SELECT ps.itemNumber, ps.item 
   FROM samd.patExtract8K(@string, '[0-9]') ps;

   SELECT itemNumber = row_number() over (order by itemNumber), item 
   FROM dbo.patternSplitCM(@string, '[0-9]')
   WHERE [matched] = 0;
 END;

[Compatibility]:
 SQL Server 2008+

[Syntax]:
--===== Autonomous
 SELECT pe.ItemNumber, pe.ItemIndex, pe.ItemLength, pe.Item
 FROM samd.patExtract8K(@string,@pattern) pe;

--===== Against a table using APPLY
 SELECT t.someString, pe.ItemIndex, pe.ItemLength, pe.Item
 FROM samd.SomeTable t
 CROSS APPLY samd.patExtract8K(t.someString, @pattern) pe;

[Parameters]:
 @string        = varchar(8000); the input string
 @searchString  = varchar(50); pattern to search for

[Returns]:
 itemNumber = bigint; the instance or ordinal position of the matched substring
 itemIndex  = bigint; the location of the matched substring inside the input string
 itemLength = int; the length of the matched substring
 item       = varchar(8000); the returned text

[Developer Notes]:
 1. Requires NGrams8k

 2. patExtract8K does not return any rows on NULL or empty strings. Consider using 
    OUTER APPLY or append the function with the code below to force the function to return 
    a row on emply or NULL inputs:

    UNION ALL SELECT 1, 0, NULL, @string WHERE nullif(@string,'') IS NULL;

 3. patExtract8K is not case sensitive; use a case sensitive collation for 
    case-sensitive comparisons

 4. patExtract8K is deterministic. For more about deterministic functions see:
    https://msdn.microsoft.com/en-us/library/ms178091.aspx

 5. patExtract8K performs substantially better with a parallel execution plan, often
    2-3 times faster. For queries that leverage patextract8K that are not getting a 
    parallel exeution plan you should consider performance testing using Traceflag 8649 
    in Development environments and Adam Machanic's make_parallel in production. 

[Examples]:
--===== (1) Basic extact all groups of numbers:
  WITH temp(id, txt) as
 (
   SELECT * FROM (values
   (1, 'hello 123 fff 1234567 and today;""o999999999 tester 44444444444444 done'),
   (2, 'syat 123 ff tyui( 1234567 and today 999999999 tester 777777 done'),
   (3, '&**OOOOO=+ + + // ==?76543// and today !!222222\\\tester{}))22222444 done'))t(x,xx)
 )
 SELECT
   [temp.id] = t.id,
   pe.itemNumber,
   pe.itemIndex,
   pe.itemLength,
   pe.item
 FROM        temp AS t
 CROSS APPLY samd.patExtract8K(t.txt, '[^0-9]') AS pe;
-----------------------------------------------------------------------------------------
Revision History:
 Rev 00 - 20170801 - Initial Development - Alan Burstein
 Rev 01 - 20180619 - Complete re-write   - Alan Burstein
*****************************************************************************************/
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT itemNumber = ROW_NUMBER() OVER (ORDER BY f.position),
       itemIndex  = f.position,
       itemLength = itemLen.l,
       item       = SUBSTRING(f.token, 1, itemLen.l)
FROM
(
 SELECT ng.position, SUBSTRING(@string,ng.position,DATALENGTH(@string))
 FROM   samd.NGrams8k(@string, 1) AS ng
 WHERE  PATINDEX(@pattern, ng.token) <  --<< this token does NOT match the pattern
        ABS(SIGN(ng.position-1)-1) +    --<< are you the first row?  OR
        PATINDEX(@pattern,SUBSTRING(@string,ng.position-1,1)) --<< always 0 for 1st row
) AS f(position, token)
CROSS APPLY (VALUES(ISNULL(NULLIF(PATINDEX('%'+@pattern+'%',f.token),0),
  DATALENGTH(@string)+2-f.position)-1)) AS itemLen(l);
GO

【讨论】：

【解决方案2】：

以下不是对问题的直接回答，而是展示如何在具有成熟的正则表达式替换功能的 PostgresSQL 中完成。预计该解决方案可能适用于使用某种库 CLR 集成的 SQL Server，但我没有这方面的经验......

SQL

SELECT REGEXP_REPLACE(
         REGEXP_REPLACE(
           REGEXP_REPLACE(phoneNumber, '((([0-9])[ ()+-]*){10,11})([^0-9]|$)', '`\1¬','g'),
           '(^|¬)[^`¬]*(`|$)', ',', 'g'),
         '(^,|,$|[^0-9,])', '', 'g')
FROM tbl;

在线演示

db-fiddle.uk 演示：https://dbfiddle.uk/?rdbms=postgres_12&fiddle=b12d9f9779b686fd0c4aa84956595f70

说明

最里面的REGEXP_REPLACE 定位10 位或11 位数字组，每个数字组后面可以有任意数量的空格、括号、加号或减号字符。该组必须后跟非数字字符或行尾。对于每个定位的组，在数字组之前附加一个`，在其后附加一个¬。 您可能需要将这些字符调整为更稀有的字符 - 它们不应出现在文本中的其他任何位置。
中间的REGEXP_REPLACE 将不在一对标记字符之间的每个文本块替换为单个逗号。
最外层的REGEXP_REPLACE 删除字符串开头或结尾的所有逗号，并删除任何不是数字或逗号的内容。

【讨论】：