ElasticSearch：使用 edge_ngram 和模糊性进行部分/精确评分答案

【问题标题】：ElasticSearch: Partial/Exact Scoring with edge_ngram & fuzzinessElasticSearch：使用 edge_ngram 和模糊性进行部分/精确评分
【发布时间】：2016-02-23 08:38:11
【问题描述】：

在 ElasticSearch 中，我正在尝试使用模糊的 edge_ngram 获得正确的评分。我希望完全匹配的分数最高，而子匹配的分数较低。以下是我的设置和评分结果。

settings: {
          number_of_shards: 1,
          analysis: {
             filter: {
                ngram_filter: {
                   type: 'edge_ngram',
                   min_gram: 2,
                   max_gram: 20
                }
             },
             analyzer: {
                ngram_analyzer: {
                   type: 'custom',
                   tokenizer: 'standard',
                   filter: [
                      'lowercase',
                      'ngram_filter'
                   ]
                }
             }
          }
       },
    mappings: [{
          name: 'voter',
          _all: {
                'type': 'string',
                'index_analyzer': 'ngram_analyzer',
                'search_analyzer': 'standard'
             },
             properties: {
                last: {
                   type: 'string',
                   required : true,
                   include_in_all: true,
                   term_vector: 'yes',
                   index_analyzer: 'ngram_analyzer',
                   search_analyzer: 'standard'
                },
                first: {
                   type: 'string',
                   required : true,
                   include_in_all: true,
                   term_vector: 'yes',
                   index_analyzer: 'ngram_analyzer',
                   search_analyzer: 'standard'
                },

             }

       }]

在使用名字“Michael”进行 POST 后，我进行如下查询，其中包含更改“Michael”、“Michae”、“Micha”、“Mich”、“Mic”和“Mi”。

GET voter/voter/_search
{
 "query": {
    "match": {
      "_all": {
        "query": "Michael",
        "fuzziness": 2,
        "prefix_length": 1
      }
    }
  }
}

我的分数结果是：

-"Michael": 0.19535106
-"Michae": 0.2242768
-"Micha": 0.24513611
-"Mich": 0.22340237
-"Mic": 0.21408978
-"Mi": 0.15438235

如您所见，评分结果并未达到预期。我希望“Michael”得分最高，“Mi”得分最低

任何帮助将不胜感激！

【问题讨论】：

比较不同查询的分数是不切实际的（深入lucene scoring function 以了解查询规范化会发生什么）。此外，您的模糊操作可能会使事情变得混乱，因为每个二元组都在彼此二元组的两个编辑范围内。尝试消除模糊性并重复您的测试。

标签： elasticsearch partial scoring exact-match

【解决方案1】：

解决此问题的一种方法是像这样在映射中添加原始文本版本

                   last: {
                       type: 'string',
                       required : true,
                       include_in_all: true,
                       term_vector: 'yes',
                       index_analyzer: 'ngram_analyzer',
                       search_analyzer: 'standard',
                       "fields": {
                            "raw": { 
                               "type":  "string"  <--- index with standard analyzer
                              }
                          }
                    },
                    first: {
                       type: 'string',
                       required : true,
                       include_in_all: true,
                       term_vector: 'yes',
                       index_analyzer: 'ngram_analyzer',
                       search_analyzer: 'standard',
                       "fields": {
                            "raw": { 
                               "type":  "string"  <--- index with standard analyzer
                              }
                          }
                    },

您也可以使用index : not_analyzed

精确

那么就可以这样查询了

{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "_all": {
              "query": "Michael",
              "fuzziness": 2,
              "prefix_length": 1
            }
          }
        },
        {
          "match": {
            "last.raw": {
              "query": "Michael",
              "boost": 5
            }
          }
        },
        {
          "match": {
            "first.raw": {
              "query": "Michael",
              "boost": 5
            }
          }
        }
      ]
    }
  }
}

匹配更多条款的文档将获得更高的分数。您可以根据自己的要求指定boost。

【讨论】：

不幸的是，这并没有完全奏效。虽然它确实为我提供了更高的精确匹配分数，但它并不能明智地处理部分匹配。