匹配查询中的所有确切单词答案

【问题标题】：Match all exact words in a query匹配查询中的所有确切单词
【发布时间】：2020-11-17 15:15:47
【问题描述】：

我想使用 ElasticSearch Java API 创建一个查询，它只匹配 (1) 完整的单词和 (2) searchquery 中的所有单词。这是一个例子：

文字：

hello wonderful world

这些应该匹配：

hello
hello wonderful
hello world
wonderful world
hello wonderful world
wonderful
world

这些不应该匹配：

hell
hello fniefsgbsugbs

我为匹配查询尝试了以下参数，但它仍然匹配上面的两个示例。

这是使用 ElasticSearch 7.7.1 Java API 生成查询的代码：

import org.elasticsearch.index.query.QueryBuilders
...

QueryBuilders.matchQuery(field, query)
            .autoGenerateSynonymsPhraseQuery(false)
            .fuzziness(0)
            .prefixLength(0)
            .fuzzyTranspositions(false)
            .operator(Operator.AND)
            .minimumShouldMatch("100%")

这将生成这个查询：

{
  "size": 100,
  "query": {
    "bool": {
      "filter": [
        {
          "match": {
            "searchableText": {
              "query": "hell",
              "operator": "AND",
              "fuzziness": "0",
              "prefix_length": 0,
              "max_expansions": 50,
              "minimum_should_match": "100%",
              "fuzzy_transpositions": false,
              "lenient": false,
              "zero_terms_query": "NONE",
              "auto_generate_synonyms_phrase_query": false,
              "boost": 1
            }
          }
        }
      ]
    }
  }
}

有人可以帮我找到一个好的解决方案吗？

编辑：以下是设置和映射（我删除了与searchableText 无关的所有内容，以使其尽可能少）：

{
    "settings": {
      "analysis": {
        "normalizer": {
          "lowercase_normalizer": {
            "type": "custom",
            "filter": [
              "lowercase"
            ]
          }
        },
        "filter": {
          "german_stemmer": {
            "type": "stemmer",
            "language": "light_german"
          },
          "ngram_filter": {
            "type": "shingle",
            "max_shingle_size": 4,
            "min_shingle_size": 2,
            "output_unigrams": false,
            "output_unigrams_if_no_shingles": false
          }
        },
        "analyzer": {
          "german": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "german_synonyms",
              "german_stop",
              "german_keywords",
              "german_no_stemming",
              "german_stemmer"
            ]
          },
          "german_ngram": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "german_synonyms",
              "german_keywords",
              "german_no_stemming",
              "german_stemmer",
              "ngram_filter"
            ]
          }
        }
      }
    },
    "mappings": {
      "properties": {
        "description": {
          "type": "text",
          "copy_to": "searchableText",
          "analyzer": "german"
        },
        "name": {
          "type": "text",
          "copy_to": "searchableText",
          "analyzer": "german"
        },
        "userTags": {
          "type": "keyword",
          "copy_to": "searchableText",
          "normalizer": "lowercase_normalizer"
        },
        "searchableText": {
          "type": "text",
          "analyzer": "german",
          "fields": {
            "ngram": {
              "type": "text",
              "analyzer": "german_ngram"
            }
          }
        },
        "searches": {
          "type": "keyword",
          "copy_to": "searchableText",
          "normalizer": "lowercase_normalizer"
        }
      }
    }
  }

编辑 2： 这些是提到的过滤器：

"filter": {
    "german_stop": {
      "type": "stop",
      "stopwords": "_german_"
    },
    "german_stemmer": {
      "type": "stemmer",
      "language": "light_german"
    },
    "ngram_filter": {
      "type": "shingle",
      "max_shingle_size": 4,
      "min_shingle_size": 2,
      "output_unigrams": false,
      "output_unigrams_if_no_shingles": false
    }
}

【问题讨论】：

请特别为字段searchableText添加索引映射以及该字段的任何相关设置。
感谢您的回复，我添加了设置。我希望这会有所帮助。
@Peter，我尝试了您的映射和示例文档，它可以按照您想要的方式工作，请参阅我的答案以获取更多详细信息。

标签： elasticsearch elasticsearch-java-api elasticsearch-7

【解决方案1】：

我尝试使用您的设置和映射创建索引，但由于未提供以下过滤器，我收到错误并在删除这些过滤器后创建了索引。

"german_synonyms",
"german_stop",
"german_keywords",
"german_no_stemming",

在我索引之后，您的示例文档 hello wonderful world 并使用了您的搜索查询，但它按您的预期正常工作，并且没有返回 hell 或 hello fniefsgbsugbs 的结果，如下所示

{
    "size": 100,
    "query": {
        "bool": {
            "filter": [
                {
                    "match": {
                        "searchableText": {
                            "query": "hello fniefsgbsugbs",
                            "operator": "AND",
                            "fuzziness": "0",
                            "prefix_length": 0,
                            "max_expansions": 50,
                            "minimum_should_match": "100%",
                            "fuzzy_transpositions": false,
                            "lenient": false,
                            "zero_terms_query": "NONE",
                            "auto_generate_synonyms_phrase_query": false,
                            "boost": 1
                        }
                    }
                }
            ]
        }
    }
}

然后它返回

"hits": {
        "total": {
            "value": 0,
            "relation": "eq"
        },
        "max_score": null,
        "hits": []
    }

Ans 与 hell 相同，但它返回与 hello、hello wonderful 和其他预期匹配的术语的结果。

编辑：您正在使用已分析的match query，即它分析搜索词，应用在字段上应用索引时间的相同分析器，并将搜索时间标记与索引匹配时间标记。

为了正确调试此类问题，请使用analyze API 并检查您的索引文档标记和搜索词标记。

【讨论】：

感谢您的回答，我的问题是我不希望hell 与hello 匹配。你知道解决方案吗？
@Peter，hell 与 hello 不匹配，我又试了一次 :)，你能提供你缺少的过滤器吗，我猜这些过滤器生成的令牌不同，这会导致问题
@Peter 或者你可以简单地用我的设置和映射再试一次，你可以看到它根据你的要求工作。
我认为您是正确的，我的设置是它无法正常工作的原因。我也将过滤器添加到我的问题中。你能看看他们吗？会不会是ngram_filter 过滤器？
@Peter，我查看了您的更新设置，但是如果您注意到您没有在 german 分析器中使用 ngram_filter，而您在 searchableText 上使用的 german 分析器 @使用ngram_filter 的987654341@ 分析器用于其名为ngram 的子字段，您没有在查询中使用（至少您提供的查询不包括在内）。

【解决方案2】：

对于索引为“关键字”的字段，我通常更喜欢 QueryString Query DSL 而不是 Match Query。例如：

{
    "query" : {
        "query_string" : {
            "query" : "my_field:('hello', 'wonderful', 'world')"
        }
    }
}

将匹配您编写的所有那些应该匹配的组合，而不是您不想要的那些。括号中的词的关系就像 SQL "IN"，所以任何出现在字段中的词都会匹配文档。此外，这种格式在创建复杂搜索时为您提供了极大的灵活性。让我知道这是否有帮助。

【讨论】：

谢谢，我尝试了你的建议，但hell 仍然匹配hello，这不是我想要的。你知道有什么办法不让单词的一部分匹配吗？