高效的字符串匹配与 SQL 和 Python答案

【问题标题】：Efficient string Match with SQL and Python高效的字符串匹配与 SQL 和 Python
【发布时间】：2020-05-31 03:34:00
【问题描述】：

我想知道使用 Python 和 PSQL 数据库进行字符串匹配的最佳方法是什么。我的数据库包含酒吧名称和邮政编码。我想检查是否有意见指的是同一个酒吧，但拼写错误。

从概念上讲，我正在考虑遍历所有名称，并且对于同一邮政编码中的每一行，使用strsim 获取字符串相似度度量。如果该指标高于阈值，我将其插入另一个存储匹配候选者的 SQL 表中。

我认为我效率低下。在“伪代码”中，拥有 pub_table、candidates_table 并使用 JaroWinkler 函数，我的意思是：

from similarity.jarowinkler import JaroWinkler
jarowinkler = JaroWinkler()

cursor = conn.cursor()
cur.execute("SELECT name, zip from pub_table")
rows = cur.fetchall()
for r in rows:
    cur.execute("SELECT name FROM pub_tables WHERE zip = %s", (r[1],))
    search = cur.fetchall()

    for pub in search:
        if jarowinkler.similarity(r[0], pub[0]) > threshold:
             insertion = ("INSERT INTO candidates_table (name1, name2, zip) 
                          VALUES (%s, %s, %s)")
             cur.execute(insertion, (r[0], pub[0], zip))

cursor.close ()
conn.commit ()
conn.close ()

如果不清楚，我很抱歉（这里是新手）。任何使用 PSQL 和 Python 进行字符串匹配的指导都将受到高度赞赏。谢谢你。

【问题讨论】：

distance_metric 的代码在哪里？
请将其视为一个给定的功能（为了完整起见，我正在使用 Jaro-Winkler 进行编辑）。我的挣扎在于我认为效率低下的配对过程。谢谢，蒂姆。
以防万一您不限于 Jaro-Winkler 距离 - PostgreSQL 在其 fuzzystrmatch 模块中内置了对 Levenshtein 距离的支持。
谢谢，尤金！我打算用几个字符串距离度量来尝试它。您可能拥有的有关 SQL 中字符串匹配最佳实践的任何 cmets 或指导都会非常有用。

标签： python psycopg2 psql string-matching

【解决方案1】：

两个 SELECT 查询都在同一个 pub_tables 表上。对于pub_tables 的每一行，带有第二个关于zip-match 查询的内部循环重复。您可以通过对其自身执行pub_tables 的 INNER JOIN 直接在一个查询中获得 zip 相等比较。

SELECT p1.name, p2.name, p1.zip
FROM   pub_table p1,
       pub_table p2
WHERE  p1.zip = p2.zip
AND    p1.name != p2.name  -- this line assumes your original pub_table
                           -- has unique names for each "same pub in same zip"
                           -- and prevents the entries from matching with themselves.

这会将您的代码减少为仅外部查询和内部检查+插入，而无需第二次查询：

cur.execute("<my query as above>")
rows = cur.fetchall()
for r in rows:
    # r[0] and r[1] are the names. r[2] is the zip code
    if jarowinkler.similarity(r[0], r[1]) > threshold:
         insertion = ("INSERT INTO candidates_table (name1, name2, zip) 
                      VALUES (%s, %s, %s)")
         # since r already a tuple with the columns in the right order,
         # you can replace the `(r[0], r[1], r[2])` below with just `r`
         cur.execute(insertion, (r[0], r[1], r[2]))
         # or ...
         cur.execute(insertion, r)

另一个变化：insertion 字符串始终相同，因此您可以将其移至 for 循环之前，并且仅将参数化的 cur.execute(insertion, r) 保留在循环内。否则，您只是一遍又一遍地重新定义相同的静态字符串。

【讨论】：

谢谢！这对于提高效率当然是有用的。你知道我在哪里可以找到关于这个主题的使用 SQL 和 Python 的资料吗？再次感谢
取决于您使用的数据库和 API。我上面提到的优化将适用于任何 DB + 编程语言组合。考虑首先了解更多关于 DB 和 SQL 的信息。接下来将介绍如何在代码中巧妙地使用 SQL 结果。如果您查看我提供的实际优化，我优化的主要是您的 SQL，而不是您的 Python :-) Python 更改是 SQL 更改的后果。你甚至可以完全换掉模糊匹配算法，它不会改变 SQL 部分。（除非您可以直接在 SQL 中进行模糊匹配。）
如果有帮助，请阅读：What should I do when someone answers my question?