模仿 Elasticsearch MatchQuery答案

【问题标题】：Mimic Elasticsearch MatchQuery模仿 Elasticsearch MatchQuery
【发布时间】：2018-07-30 21:28:39
【问题描述】：

我目前正在编写一个程序，该程序目前使用弹性搜索作为后端数据库/搜索索引。我想模仿/_search endpoint 的功能，它目前使用匹配查询：

{
    "query": {
        "match" : {
            "message" : "Neural Disruptor"
        }
    }
}

做一些示例查询，在大量 World of Warcraft database 上产生了以下结果：

   Search Term          Search Result      
------------------ ----------------------- 
 Neural Disruptor   Neural Needler         
 Lovly bracelet     Ruby Bracelet          
 Lovely bracelet    Lovely Charm Bracelet

查看了elasticsearch的文档后，我发现匹配查询相当复杂。我可以在java中仅使用lucene模拟匹配查询的最简单方法是什么？（它似乎在做一些模糊匹配，以及寻找术语）

为 MatchQuery 导入 elasticsearch 代码（我相信org.elasticsearch.index.search.MatchQuery）似乎并不那么容易。它被大量嵌入到 Elasticsearch 中，看起来不像是可以轻易拔出的东西。

我不需要完整的证明“必须完全匹配 elasticsearch 匹配的内容”，我只需要接近的东西，或者可以模糊匹配/找到最佳匹配的东西。

【问题讨论】：

这样做的唯一方法是解析输入并创建一个 query_string 查询，这是 lucene 的。它在文档中这么说（匹配查询是 query_string 的子集）。虽然这不是微不足道的。我曾经不得不做类似的事情，我使用 antlr 生成了一个 AST，解析它并创建了其他东西。
这不是那么容易，否则我会有。我必须读一本书才能实现我上面提到的（为了使用 antlr4）。在您的情况下，您可以使用分析器对输入进行标记，检查指定的运算符（或使用默认值）并尝试添加所需的布尔运算符。另一方面，elasticsearch 是开源的，难道不能从源代码中定位和隔离实现吗？
截至目前，当前添加的答案确实给出了一些方向，但为了获得完整的赏金，我希望看到一个可以生成查询以获取的实际 QueryParser与 elasticsearch 类似的结果。

标签： java elasticsearch lucene

【解决方案1】：

发送到_search 端点的q= 参数的任何内容都被理解Lucene expression syntax 的query_string 查询（不是org.elasticsearch.index.search.MatchQuery）按原样使用。

查询解析器语法是在 Lucene 项目中使用JavaCC 定义的，如果您想看一下，可以在here 找到语法。最终产品是一个名为QueryParser 的类（见下文）。

ES 源代码中负责解析查询字符串的类是QueryStringQueryParser，它委托给Lucene 的QueryParser 类（由JavaCC 生成）。

所以基本上，如果您得到一个与传递给_search?q=... 的查询字符串等效的查询字符串，那么您可以将该查询字符串与QueryParser.parse("query-string-goes-here") 一起使用，并仅使用Lucene 运行具体化的Query。

【讨论】：

所以我需要撕开QueryStringParser，填写默认上下文，然后运行它来生成查询字符串？看起来这非常复杂，涉及多个嵌入式查询（MultiMatchQuery，使用even more Queries。主要问题是所有这些都需要ShardContext，这似乎是elasticsearch特有的，并且非常复杂。
我会先从 Lucene 的 QueryParser 开始，而不是太在意 QueryStringQueryParser，它只是 Lucene 的 QueryParser 的包装器，负责解析 ES 的参数 query_string 查询 Lucene 是反正不知道。
有没有办法调试在 ElasticSearch 的 API 上生成的实际查询？解释命令似乎没有做太多，虽然这个答案有帮助，但我觉得我离解决方案还有很长的路要走。

【解决方案2】：

自从我直接使用 lucene 以来已经有一段时间了，但你想要的应该是，最初，相当简单。 lucene 查询的基本行为与 match 查询非常相似（query_string 完全等同于 lucene，但 match 非常接近）。我整理了一个small example，如果你想尝试一下，它只适用于 lucene (7.2.1)。主要代码如下：

public static void main(String[] args) throws Exception {
    // Create the in memory lucence index
    RAMDirectory ramDir = new RAMDirectory();

    // Create the analyzer (has default stop words)
    Analyzer analyzer = new StandardAnalyzer();

    // Create a set of documents to work with
    createDocs(ramDir, analyzer);

    // Query the set of documents
    queryDocs(ramDir, analyzer);
}

private static void createDocs(RAMDirectory ramDir, Analyzer analyzer) 
        throws IOException {
    // Setup the configuration for the index
    IndexWriterConfig config = new IndexWriterConfig(analyzer);
    config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);

    // IndexWriter creates and maintains the index
    IndexWriter writer = new IndexWriter(ramDir, config);

    // Create the documents
    indexDoc(writer, "document-1", "hello planet mercury");
    indexDoc(writer, "document-2", "hi PLANET venus");
    indexDoc(writer, "document-3", "howdy Planet Earth");
    indexDoc(writer, "document-4", "hey planet MARS");
    indexDoc(writer, "document-5", "ayee Planet jupiter");

    // Close down the writer
    writer.close();
}

private static void indexDoc(IndexWriter writer, String name, String content) 
        throws IOException {
    Document document = new Document();
    document.add(new TextField("name", name, Field.Store.YES));
    document.add(new TextField("body", content, Field.Store.YES));

    writer.addDocument(document);
}

private static void queryDocs(RAMDirectory ramDir, Analyzer analyzer) 
        throws IOException, ParseException {
    // IndexReader maintains access to the index
    IndexReader reader = DirectoryReader.open(ramDir);

    // IndexSearcher handles searching of an IndexReader
    IndexSearcher searcher = new IndexSearcher(reader);

    // Setup a query
    QueryParser parser = new QueryParser("body", analyzer);
    Query query = parser.parse("hey earth");

    // Search the index
    TopDocs foundDocs = searcher.search(query, 10);
    System.out.println("Total Hits: " + foundDocs.totalHits);

    for (ScoreDoc scoreDoc : foundDocs.scoreDocs) {
        // Get the doc from the index by id
        Document document = searcher.doc(scoreDoc.doc);
        System.out.println("Name: " + document.get("name") 
                + " - Body: " + document.get("body") 
                + " - Score: " + scoreDoc.score);
    }

    // Close down the reader
    reader.close();
}

扩展它的重要部分将是 analyzer 和理解 lucene query parser syntax。

索引和查询都使用Analyzer 来告诉两者如何解析文本，以便他们可以以相同的方式思考文本。它设置了如何标记化（拆分什么，是否 toLower() 等）。 StandardAnalyzer 拆分空间和其他几个（我没有这个方便），并且看起来也适用于 toLower()。

QueryParser 将为您完成一些工作。如果您在我的示例中看到上面的内容。我做了两件事，我告诉解析器默认字段是什么，然后我传递一个hey earth 的字符串。解析器将把它变成一个看起来像body:hey body:earth 的查询。这将查找在body 中具有hey 或earth 的文档。将找到两个文档。

如果我们要传递hey AND earth，则查询将被解析为类似于+body:hey +body:earth，这将要求文档同时具有这两个术语。将找到零个文档。

要应用模糊选项，您可以将~ 添加到要模糊的术语中。因此，如果查询是hey~ earth，它将对hey 应用模糊性，并且查询看起来像body:hey~2 body:earth。将找到三个文档。

您可以更直接地编写查询，而解析器仍然可以处理事情。因此，如果您将它传递给hey name:\"document-1\"（它的令牌在- 上拆分），它将创建一个类似body:hey name:"document 1" 的查询。在查找短语document 1 时将返回两个文档（因为它仍然在- 上进行标记）。如果我做了hey name:document-1，它会写body:hey (name:document name:1)，它会返回所有文档，因为它们都有document作为术语。理解这里有一些细微差别。

我将尝试更多地介绍它们的相似之处。引用match query。 Elastic 表示，主要区别在于，“它不支持字段名称前缀、通配符或其他“高级”功能。”这些可能会在另一个方向上更加突出。

match 查询和 lucene 查询在处理分析字段时都会获取查询字符串并将分析器应用于它（对其进行标记、toLower 等）。因此，它们都会将HEY Earth 转换为查找术语hey 或earth 的查询。

匹配查询可以通过提供"operator" : "and" 来设置operator。这将我们的查询更改为查找hey 和earth。 lucene 中的类比是做类似parser.setDefaultOperator(QueryParser.Operator.AND);

接下来是fuzziness。两者都使用相同的设置。我相信在将~ 应用于查询时，弹性的"fuzziness": "AUTO" 相当于lucene 的自动（尽管我认为您必须自己添加每个术语，这有点麻烦）。

零词查询似乎是一种弹性结构。如果您想要 ALL 设置，如果查询解析器从查询中删除了所有标记，则必须复制 match all 查询。

截止频率查询似乎与CommonTermsQuery 有关。这个我没用过，如果你想用的话，你可能需要一些挖掘。

Lucene 有一个 synonym filter 可应用于分析器，但您可能需要自己 build the map。

您可能会发现不同之处可能在于得分。当我运行时，他们针对 lucene 查询 hey earth。它得到的 document-3 和 document-4 都返回了1.3862944 的分数。当我以以下形式运行查询时：

curl -XPOST http://localhost:9200/index/_search?pretty -d '{
  "query" : {
    "match" : {
      "body" : "hey earth"
    }
  }
}'

我得到了相同的文件，但得分为1.219939。您可以对它们进行解释。在 lucene 中通过使用

打印每个文档

System.out.println(searcher.explain(query, scoreDoc.doc));

并且在弹性中通过像这样查询每个文档

curl -XPOST http://localhost:9200/index/docs/3/_explain?pretty -d '{
  "query" : {
    "match" : {
      "body" : "hey earth"
    }
  }
}'

我得到了一些差异，但我无法准确地解释它们。我确实得到了1.3862944 文档的值，但fieldLength 不同，这会影响重量。

【讨论】：