具有真正“全文搜索”和拼写错误的 Sqlite（FTS+spellfix 一起）答案

【问题标题】：Sqlite with real "Full Text Search" and spelling mistakes (FTS+spellfix together)具有真正“全文搜索”和拼写错误的 Sqlite（FTS+spellfix 一起）
【发布时间】：2019-03-19 01:42:59
【问题描述】：

假设我们有 100 万行这样的行：

import sqlite3
db = sqlite3.connect(':memory:')
c = db.cursor()
c.execute('CREATE TABLE mytable (id integer, description text)')
c.execute('INSERT INTO mytable VALUES (1, "Riemann")')
c.execute('INSERT INTO mytable VALUES (2, "All the Carmichael numbers")')

背景：

我知道如何用 Sqlite 做到这一点：

使用spellfix 模块和Levenshtein 距离查找具有单字查询 的一行，最多有几个拼写错误（我已发布detailed answer here关于如何编译它，如何使用它，...）：

db.enable_load_extension(True)
db.load_extension('./spellfix')
c.execute('SELECT * FROM mytable WHERE editdist3(description, "Riehmand") < 300'); print c.fetchall()

#Query: 'Riehmand'
#Answer: [(1, u'Riemann')]

如果有 1M 行，这会非常慢！作为detailed here，postgresql 可能会使用trigrams 对此进行优化。 Sqlite 提供的一个快速解决方案是使用VIRTUAL TABLE USING spellfix：

c.execute('CREATE VIRTUAL TABLE mytable3 USING spellfix1')
c.execute('INSERT INTO mytable3(word) VALUES ("Riemann")')
c.execute('SELECT * FROM mytable3 WHERE word MATCH "Riehmand"'); print c.fetchall()

#Query: 'Riehmand'
#Answer: [(u'Riemann', 1, 76, 0, 107, 7)], working!

使用 FTS（“全文搜索”）查找与 一个或多个单词匹配的查询表达式：

c.execute('CREATE VIRTUAL TABLE mytable2 USING fts4(id integer, description text)')
c.execute('INSERT INTO mytable2 VALUES (2, "All the Carmichael numbers")')
c.execute('SELECT * FROM mytable2 WHERE description MATCH "NUMBERS carmichael"'); print c.fetchall()

#Query: 'NUMBERS carmichael'
#Answer: [(2, u'All the Carmichael numbers')]

它不区分大小写，您甚至可以使用两个单词顺序错误的查询，等等：FTS 确实非常强大。但缺点是每个查询关键字都必须正确拼写，即 FTS 本身不允许出现拼写错误。

问题：

如何使用 Sqlite 进行全文搜索 (FTS) 并允许拼写错误？ 即“FTS + spellfix”一起

例子：

数据库中的行："All the Carmichael numbers"
查询："NUMMBER carmickaeel" 应该匹配它！

如何用 Sqlite 做到这一点？

由于this page 状态，Sqlite 可能是可能的：

或者，它 [spellfix] 可以与 FTS4 一起使用，使用可能拼写错误的单词进行全文搜索。

【问题讨论】：

为什么不将 MATCH 与 spellfix 虚拟表一起使用，而不是使用 editdist 呢？它会快很多。（根据我的经验，在有几十万行的表上几乎是即时的）
它并没有真正回答您的实际问题（我认为答案是，如果文档中描述的辅助表不符合您的要求，则没有实用的方法）。看到一个使用 spellfix 进行近似匹配而没有实际使用 spellfix 的例子，真是奇怪。您链接到的拼写修复文档包含常用用法示例。
@Shawn 我编辑了这个问题（参见第一个要点的第二部分），以展示您可能谈到的内容的示例（使用 VIRTUAL TABLE USING spellfix1）。关于 spellfix1 的页面声明 Or, it could be used with FTS4 to do full-text search using potentially misspelled words. 但我找不到让它工作的方法。可以举个例子吗？
我认为的目的是您将 FTS 语料库添加到拼写表中，并且对于您要在 FTS 语料库中查找的每个单词，您首先将其与spellfix 表并使用 FTS 查询中的第一个结果。不过，这似乎不太实用，尤其是当您搜索的不仅仅是一个单词时。
也许是@Shawn，但我不知道该怎么做......我会尽可能地创建一个赏金，因为这种情况在应用程序中真的很重要。能够通过查询"NUMMBER carmickaeel" 找到行"All the Carmichael numbers"（即两者都有 1）不是所有的单词 2）拼写错误）是在数据库中搜索文本领域的圣杯。因此，为此提供一个现成的 Sqlite 代码示例将非常有趣，并且可以为搜索提供类似 Google 的用户体验。

标签： python sqlite full-text-search levenshtein-distance

【解决方案1】：

spellfix1 文档实际上告诉您如何执行此操作。来自Overview section：

如果您打算将此虚拟表与 FTS4 表配合使用（用于搜索词的拼写更正），那么您可以使用 fts4aux 表来提取词汇表：
INSERT INTO demo(word) SELECT term FROM search_aux WHERE col='*';

SELECT term from search_aux WHERE col='*' 声明 extracts all the indexed tokens。

将此与您的示例联系起来，其中mytable2 是您的 fts4 虚拟表，您可以创建一个 fts4aux 表并将这些令牌插入到您的 mytable3 spellfix1 表中：

CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2);
INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*';

您可能希望进一步限定该查询以跳过已插入 spellfix1 的任何术语，否则您最终会出现重复条目：

INSERT INTO mytable3(word)
    SELECT term FROM mytable2_terms
    WHERE col='*' AND 
        term not in (SELECT word from mytable3_vocab);

现在您可以使用mytable3 将拼写错误的单词映射到更正的标记，然后在MATCH 查询中使用这些更正的标记来对抗mytable2。

根据您的需要，这可能意味着您需要进行自己的令牌处理和查询构建；没有暴露的 fts4 查询语法解析器。因此，您的双令牌搜索字符串需要拆分，每个令牌都通过 spellfix1 表运行以映射到现有令牌，然后将这些令牌馈送到 fts4 查询。

忽略 SQL 语法来处理这个问题，使用 Python 进行拆分很容易：

def spellcheck_terms(conn, terms):
    cursor = conn.cursor()
    base_spellfix = """
        SELECT :term{0} as term, word FROM spellfix1data
        WHERE word MATCH :term{0} and top=1
    """
    terms = terms.split()
    params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
    query = " UNION ".join([
        base_spellfix.format(i + 1) for i in range(len(params))])
    cursor.execute(query, params)
    correction_map = dict(cursor)
    return " ".join([correction_map.get(t, t) for t in terms])

def spellchecked_search(conn, terms):
    corrected_terms = spellcheck_terms(conn, terms)
    cursor = conn.cursor()
    fts_query = 'SELECT * FROM mytable2 WHERE mytable2 MATCH ?'
    cursor.execute(fts_query, (corrected_terms,))
    return cursor.fetchall()

然后，这会为 spellchecked_search(db, "NUMMBER carmickaeel") 返回 [('All the Carmichael numbers',)]。

在 Python 中保留拼写检查处理，然后您可以根据需要支持更复杂的 FTS 查询；您可能需要reimplement the expression parser 才能这样做，但至少 Python 为您提供了执行此操作的工具。

一个完整的例子，将上述方法打包到一个类中，它只是将术语提取为字母数字字符序列（根据我对表达式语法规范的阅读，这就足够了）：

import re
import sqlite3
import sys

class FTS4SpellfixSearch(object):
    def __init__(self, conn, spellfix1_path):
        self.conn = conn
        self.conn.enable_load_extension(True)
        self.conn.load_extension(spellfix1_path)

    def create_schema(self):
        self.conn.executescript(
            """
            CREATE VIRTUAL TABLE IF NOT EXISTS fts4data
                USING fts4(description text);
            CREATE VIRTUAL TABLE IF NOT EXISTS fts4data_terms
                USING fts4aux(fts4data);
            CREATE VIRTUAL TABLE IF NOT EXISTS spellfix1data
                USING spellfix1;
            """
        )

    def index_text(self, *text):
        cursor = self.conn.cursor()
        with self.conn:
            params = ((t,) for t in text)
            cursor.executemany("INSERT INTO fts4data VALUES (?)", params)
            cursor.execute(
                """
                INSERT INTO spellfix1data(word)
                SELECT term FROM fts4data_terms
                WHERE col='*' AND
                    term not in (SELECT word from spellfix1data_vocab)
                """
            )

    # fts3 / 4 search expression tokenizer
    # no attempt is made to validate the expression, only
    # to identify valid search terms and extract them.
    # the fts3/4 tokenizer considers any alphanumeric ASCII character
    # and character in the range U+0080 and over to be terms.
    if sys.maxunicode == 0xFFFF:
        # UCS2 build, keep it simple, match any UTF-16 codepoint 0080 and over
        _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\uffff]+")
    else:
        # UCS4
        _fts4_expr_terms = re.compile(u"[a-zA-Z0-9\u0080-\U0010FFFF]+")

    def _terms_from_query(self, search_query):
        """Extract search terms from a fts3/4 query

        Returns a list of terms and a template such that
        template.format(*terms) reconstructs the original query.

        terms using partial* syntax are ignored, as you can't distinguish
        between a misspelled prefix search that happens to match existing
        tokens and a valid spelling that happens to have 'near' tokens in
        the spellfix1 database that would not otherwise be matched by fts4

        """
        template, terms, lastpos = [], [], 0
        for match in self._fts4_expr_terms.finditer(search_query):
            token, (start, end) = match.group(), match.span()
            # skip columnname: and partial* terms by checking next character
            ismeta = search_query[end:end + 1] in {":", "*"}
            # skip digits if preceded by "NEAR/"
            ismeta = ismeta or (
                token.isdigit() and template and template[-1] == "NEAR"
                and "/" in search_query[lastpos:start])
            if token not in {"AND", "OR", "NOT", "NEAR"} and not ismeta:
                # full search term, not a keyword, column name or partial*
                terms.append(token)
                token = "{}"
            template += search_query[lastpos:start], token
            lastpos = end
        template.append(search_query[lastpos:])
        return terms, "".join(template)

    def spellcheck_terms(self, search_query):
        cursor = self.conn.cursor()
        base_spellfix = """
            SELECT :term{0} as term, word FROM spellfix1data
            WHERE word MATCH :term{0} and top=1
        """
        terms, template = self._terms_from_query(search_query)
        params = {"term{}".format(i): t for i, t in enumerate(terms, 1)}
        query = " UNION ".join(
            [base_spellfix.format(i + 1) for i in range(len(params))]
        )
        cursor.execute(query, params)
        correction_map = dict(cursor)
        return template.format(*(correction_map.get(t, t) for t in terms))

    def search(self, search_query):
        corrected_query = self.spellcheck_terms(search_query)
        cursor = self.conn.cursor()
        fts_query = "SELECT * FROM fts4data WHERE fts4data MATCH ?"
        cursor.execute(fts_query, (corrected_query,))
        return {
            "terms": search_query,
            "corrected": corrected_query,
            "results": cursor.fetchall(),
        }

以及使用该类的交互式演示：

>>> db = sqlite3.connect(":memory:")
>>> fts = FTS4SpellfixSearch(db, './spellfix')
>>> fts.create_schema()
>>> fts.index_text("All the Carmichael numbers")  # your example
>>> from pprint import pprint
>>> pprint(fts.search('NUMMBER carmickaeel'))
{'corrected': 'numbers carmichael',
 'results': [('All the Carmichael numbers',)],
 'terms': 'NUMMBER carmickaeel'}
>>> fts.index_text(
...     "They are great",
...     "Here some other numbers",
... )
>>> pprint(fts.search('here some'))  # edgecase, multiple spellfix matches
{'corrected': 'here some',
 'results': [('Here some other numbers',)],
 'terms': 'here some'}
>>> pprint(fts.search('NUMMBER NOT carmickaeel'))  # using fts4 query syntax 
{'corrected': 'numbers NOT carmichael',
 'results': [('Here some other numbers',)],
 'terms': 'NUMMBER NOT carmickaeel'}

【讨论】：

@Basj：对execute() 的一次调用很容易击败对execute() 的多次调用，所以是的，UNION 查询会更快。 SELECT :termx, word ... 查询为遗漏留出了空间，因此无法纠正拼写错误。否则nummbers auoaaixao carrmichal 将导致numbers 和carmichael 之间的结果为空，并且您无法确定哪个输入映射到哪个输出。包含原始未更正的术语后，您现在可以轻松地将它们映射到更正，而无需理会不可纠正的术语。
@Basj：我用 1000 个随机生成的“单词”对此处生成的数据库进行了测试，使用基于 UNION 的查询的函数每次测试执行时间为 2.6 毫秒（100 次重复），而每个术语循环和执行单独查询的函数在 2.9 毫秒内执行（重复次数相同）。这是一个有限的测试，这里生成的有限的spellfix1 表只为这 100 个随机项生成了一个结果，但它很好地表明了为什么你想在这里使用 UNION。

【解决方案2】：

接受的答案很好（完全归功于他），这里有一个细微的变化，虽然不如接受的复杂案例完整，但有助于理解这个想法：

import sqlite3
db = sqlite3.connect(':memory:')
db.enable_load_extension(True)
db.load_extension('./spellfix')
c = db.cursor()
c.execute("CREATE VIRTUAL TABLE mytable2 USING fts4(description text)")
c.execute("CREATE VIRTUAL TABLE mytable2_terms USING fts4aux(mytable2)")
c.execute("CREATE VIRTUAL TABLE mytable3 USING spellfix1")
c.execute("INSERT INTO mytable2 VALUES ('All the Carmichael numbers')")   # populate the table
c.execute("INSERT INTO mytable2 VALUES ('They are great')")
c.execute("INSERT INTO mytable2 VALUES ('Here some other numbers')")
c.execute("INSERT INTO mytable3(word) SELECT term FROM mytable2_terms WHERE col='*'")

def search(query):
    # Correcting each query term with spellfix table
    correctedquery = []
    for t in query.split():
        spellfix_query = "SELECT word FROM mytable3 WHERE word MATCH ? and top=1"
        c.execute(spellfix_query, (t,))
        r = c.fetchone()
        correctedquery.append(r[0] if r is not None else t)  # correct the word if any match in the spellfix table; if no match, keep the word spelled as it is (then the search will give no result!)

    correctedquery = ' '.join(correctedquery)

    # Now do the FTS
    fts_query = 'SELECT * FROM mytable2 WHERE description MATCH ?'
    c.execute(fts_query, (correctedquery,))
    return {'result': c.fetchall(), 'correctedquery': correctedquery, 'query': query}

print(search('NUMBBERS carmickaeel'))
print(search('some HERE'))
print(search('some qsdhiuhsd'))

结果如下：

{'query': 'NUMBBERS carmickaeel', 'correctedquery': u'numbers carmichael', 'result': [(u'All the Carmichael numbers',)]}
{'query': 'some HERE', 'correctedquery': u'some here', 'result': [(u'Here some other numbers',)]}
{'query': 'some qsdhiuhsd', 'correctedquery': u'some qsdhiuhsd', 'result': []}

备注：需要注意的是，“使用拼写表更正每个查询词”部分是通过每个词一个 SQL 查询来完成的。研究了此与单个 UNION SQL 查询的性能here。

【讨论】：