【问题标题】:Elastic search edge ngram not returning all expected results弹性搜索边缘 ngram 未返回所有预期结果
【发布时间】:2021-01-18 01:00:37
【问题描述】:

我很难找到弹性搜索查询的意外结果。将以下文档索引到弹性搜索中。

{
"group": "J00-I99", codes: [
   { "id": "J15", "description": "hello world" },
   { "id": "J15.0", "description": "test one world" },
   { "id": "J15.1", "description": "test two world J15.0" },
   { "id": "J15.2", "description": "test two three world J15" },
   { "id": "J15.3", "description": "hello world J18 " },
    ............................ // Similar records here
   { "id": "J15.9", "description": "hello world new" },
   { "id": "J16.0", "description": "new description" }
]
}

我的目标是实现自动完成功能,为此我使用了 n-gram 方法。我不想使用完整的建议方法。

目前我遇到了两个问题:

  1. 搜索查询(id 和描述字段):J15

预期结果:以上所有结果,包括 J15 实际结果:只得到很少的结果(J15.0、J15.1、J15.8)

  1. 搜索查询(id 和 description 字段):测试两个

预期结果:

{ "id": "J15.1", "description": "test two world J15.0" },
{ "id": "J15.2", "description": "test two three world J15" },

实际结果:

   { "id": "J15.0", "description": "test one world" },
   { "id": "J15.1", "description": "test two world J15.0" },
   { "id": "J15.2", "description": "test two three world J15" },

然后映射就这样完成了。

           {

                settings: {
                    number_of_shards: 1,
                    analysis: {
                        filter: {
                            ngram_filter: {
                                type: 'edge_ngram',
                                min_gram: 2,
                                max_gram: 20
                            }
                        },
                        analyzer: {
                            ngram_analyzer: {
                                type: 'custom',
                                tokenizer: 'standard',
                                filter: [
                                    'lowercase', 'ngram_filter'
                                ]
                            }
                        }
                    }
                },
                mappings: {
                    properties: {
                        group: {
                            type: 'text'
                        },
                        codes: {
                            type: 'nested',
                            properties: {
                                id: {
                                    type: 'text',
                                    analyzer: 'ngram_analyzer',
                                    search_analyzer: 'standard'
                                },
                                description: {
                                    type: 'text',
                                    analyzer: 'ngram_analyzer',
                                    search_analyzer: 'standard'
                                }
                            }
                        }
                    }
                }
            }

搜索查询:

GET myindex/_search
{
  "_source": {
    "excludes": [
      "codes"
    ]
  },
  "query": {
    "nested": {
      "path": "codes",
      "query": {
        "bool": {
          "should": [
            {
              "match": {
                "codes.description": "J15"
              }
            },
            {
              "match": {
                "codes.id": "J15"
              }
            }
          ]
        }
      },
      "inner_hits": {}
    }
  }
}

注意:文档索引会很大。这里仅提及示例数据。

对于第二个问题,我可以像下面这样使用带有 AND 运算符的 multi_match 吗?

GET myindex/_search
{
  "_source": {
    "excludes": [
      "codes"
    ]
  },
  "query": {
    "nested": {
      "path": "codes",
      "query": {
        "bool": {
          "should": [
            {
              "multi_match": {
                    "query": "J15",
                    "fields": ["codes.id", "codes.description"],
                    "operator": and
                }
            }
          ]
        }
      },
      "inner_hits": {}
    }
  }
}

任何帮助将不胜感激,因为我很难解决这个问题。

【问题讨论】:

    标签: elasticsearch autocomplete elasticsearch-query elasticsearch-mapping elasticsearch-analyzers


    【解决方案1】:

    添加另一个答案,因为它是一个不同的问题,第一个答案集中在第一个问题上。

    问题是您的第二个查询 test two 返回 test one world 以及在索引时您使用的是 ngram_analyzer,它使用 标准分析器将文本拆分为空格和你的搜索分析器又是standard,所以如果你在索引文档和搜索词上使用Analyze API,你会看到它与标记匹配:

    {
       "text" : "test one world",
       "analyzer" : "standard"
    }
    

    并生成令牌

    {
        "tokens": [
            {
                "token": "test",
                "start_offset": 0,
                "end_offset": 4,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "one",
                "start_offset": 5,
                "end_offset": 8,
                "type": "<ALPHANUM>",
                "position": 1
            },
            {
                "token": "world",
                "start_offset": 9,
                "end_offset": 14,
                "type": "<ALPHANUM>",
                "position": 2
            }
        ]
    }
    

    对于您的搜索词test two

    {
        "tokens": [
            {
                "token": "test",
                "start_offset": 0,
                "end_offset": 4,
                "type": "<ALPHANUM>",
                "position": 0
            },
            {
                "token": "two",
                "start_offset": 5,
                "end_offset": 8,
                "type": "<ALPHANUM>",
                "position": 1
            }
        ]
    }
    

    如您所见,test 令牌存在于您的文档中,因此您会得到该搜索结果。可以通过在查询中使用AND运算符来解决,如下所示

    搜索查询

    {
        "_source": {
            "excludes": [
                "codes"
            ]
        },
        "query": {
            "nested": {
                "path": "codes",
                "query": {
                    "bool": {
                        "must": {
                            "multi_match": {
                                "query": "test two",
                                "fields": [
                                    "codes.id",
                                    "codes.description"
                                ],
                                "operator" :"AND"
                            }
                        }
                    }
                },
                "inner_hits": {}
            }
        }
    }
    

    以及搜索结果

     "hits": [
                                    {
                                        "_index": "myindexedge64170045",
                                        "_type": "_doc",
                                        "_id": "1",
                                        "_nested": {
                                            "field": "codes",
                                            "offset": 2
                                        },
                                        "_score": 2.6901608,
                                        "_source": {
                                            "id": "J15.1",
                                            "description": "test two world J15.0"
                                        }
                                    },
                                    {
                                        "_index": "myindexedge64170045",
                                        "_type": "_doc",
                                        "_id": "1",
                                        "_nested": {
                                            "field": "codes",
                                            "offset": 3
                                        },
                                        "_score": 2.561376,
                                        "_source": {
                                            "id": "J15.2",
                                            "description": "test two three world J15"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
    

    【讨论】:

    • 我遇到了问题,但我正在寻找的是一个解决方案.. 我在描述中添加了我可以将 multi_match 与 AND 运算符一起使用。你能看看那个(最后一部分)
    • 但我看不到您的查询,它应该可以工作,如果您可以添加您的查询,我会努力使其工作
    • 它已经在下面“对于第二个问题,我可以像下面这样使用带有 AND 运算符的 multi_match 吗?”部分。请看。
    • 刚刚试了一下,效果不错,请看更新
    • 是的,毫无疑问,您可以看到它返回了预期的搜索结果 :) 我使用了您的映射和数据
    【解决方案2】:

    问题是默认情况下 inner_hits 仅返回 3 个匹配的文档,如 this official doc 中所述,

    尺寸

    每个 inner_hits 返回的最大命中数。 默认情况下 返回前三个匹配的匹配项。

    只需在您的 inner_hits 中添加 size 参数即可获得所有搜索结果。

      "inner_hits": {
                    "size": 10 // note this
                }
    

    对您的示例数据进行了尝试,并查看了您的第一个查询的搜索结果,该查询仅返回 3 个搜索结果

    第一个查询搜索结果

       "hits": [
                                    {
                                        "_index": "myindexedge64170045",
                                        "_type": "_doc",
                                        "_id": "1",
                                        "_nested": {
                                            "field": "codes",
                                            "offset": 2
                                        },
                                        "_score": 1.8687118,
                                        "_source": {
                                            "id": "J15.1",
                                            "description": "test two world J15.0"
                                        }
                                    },
                                    {
                                        "_index": "myindexedge64170045",
                                        "_type": "_doc",
                                        "_id": "1",
                                        "_nested": {
                                            "field": "codes",
                                            "offset": 3
                                        },
                                        "_score": 1.7934312,
                                        "_source": {
                                            "id": "J15.2",
                                            "description": "test two three world J15"
                                        }
                                    },
                                    {
                                        "_index": "myindexedge64170045",
                                        "_type": "_doc",
                                        "_id": "1",
                                        "_nested": {
                                            "field": "codes",
                                            "offset": 0
                                        },
                                        "_score": 0.29618382,
                                        "_source": {
                                            "id": "J15",
                                            "description": "hello world"
                                        }
                                    },
                                    {
                                        "_index": "myindexedge64170045",
                                        "_type": "_doc",
                                        "_id": "1",
                                        "_nested": {
                                            "field": "codes",
                                            "offset": 1
                                        },
                                        "_score": 0.29618382,
                                        "_source": {
                                            "id": "J15.0",
                                            "description": "test one world"
                                        }
                                    },
                                    {
                                        "_index": "myindexedge64170045",
                                        "_type": "_doc",
                                        "_id": "1",
                                        "_nested": {
                                            "field": "codes",
                                            "offset": 4
                                        },
                                        "_score": 0.29618382,
                                        "_source": {
                                            "id": "J15.3",
                                            "description": "hello world J18 "
                                        }
                                    },
                                    {
                                        "_index": "myindexedge64170045",
                                        "_type": "_doc",
                                        "_id": "1",
                                        "_nested": {
                                            "field": "codes",
                                            "offset": 5
                                        },
                                        "_score": 0.29618382,
                                        "_source": {
                                            "id": "J15.9",
                                            "description": "hello world new"
                                        }
                                    }
                                ]
                            }
                        }
                    }
                }
    

    【讨论】:

    • 谢谢。你能帮我解决提到的第二个问题吗?
    • @Vishnu 当然,您确认您的第一个查询是否有效?
    • @opster-elastic-search-ninja 是的,我已经通过添加大小进行了验证。很少有查询,因为数据量很大,我可以添加超过 100 的大小吗?另外,我如何添加文档值字段而不是获取 _source?推荐吗?
    • @Vishnu 你问了这么多有价值的后续问题,这些问题会在这些 cmets 中丢失,我会请求你投票并接受我的回答,因为它解决了你的问题并要求跟进问题,一旦你给出他们的链接,我会回答他们。
    • 其实这些点并不是真正的另一个s.o问题..所以你能在这里解释一下吗,如果你不介意..
    【解决方案3】:

    添加一个带有索引映射、搜索查询和搜索结果的工作示例

    索引映射:

    {
      "settings": {
        "analysis": {
          "analyzer": {
            "my_analyzer": {
              "tokenizer": "my_tokenizer"
            }
          },
          "tokenizer": {
            "my_tokenizer": {
              "type": "edge_ngram",
              "min_gram": 2,
              "max_gram": 20,
              "token_chars": [
                "letter",
                "digit"
              ]
            }
          }
        },
        "max_ngram_diff": 50
      },
      "mappings": {
        "properties": {
          "group": {
            "type": "text"
          },
          "codes": {
            "type": "nested",
            "properties": {
              "id": {
                "type": "text",
                "analyzer": "my_analyzer"
              }
            }
          }
        }
      }
    }
    

    索引数据:

    {
        "group": "J00-I99", 
        "codes": [
            {
                "id": "J15",
                "description": "hello world"
            },
            {
                "id": "J15.0",
                "description": "test one world"
            },
            {
                "id": "J15.1",
                "description": "test two world J15.0"
            },
            {
                "id": "J15.2",
                "description": "test two three world J15"
            },
            {
                "id": "J15.3",
                "description": "hello world J18 "
            },
            {
                "id": "J15.9",
                "description": "hello world new"
            },
            {
                "id": "J16.0",
                "description": "new description"
            }
        ]
    }
    

    搜索查询:

    {
        "_source": {
            "excludes": [
                "codes"
            ]
        },
        "query": {
            "nested": {
                "path": "codes",
                "query": {
                    "bool": {
                        "should": [
                            {
                                "match": {
                                    "codes.description": "J15"
                                }
                            },
                            {
                                "match": {
                                    "codes.id": "J15"
                                }
                            }
                        ],
                        "must": {
                            "multi_match": {
                                "query": "test two",
                                "fields": [
                                    "codes.id",
                                    "codes.description"
                                ],
                                "type": "phrase"
                            }
                        }
                    }
                },
                "inner_hits": {}
            }
        }
    }
    

    搜索结果:

    "inner_hits": {
              "codes": {
                "hits": {
                  "total": {
                    "value": 2,
                    "relation": "eq"
                  },
                  "max_score": 3.2227304,
                  "hits": [
                    {
                      "_index": "stof_64170045",
                      "_type": "_doc",
                      "_id": "1",
                      "_nested": {
                        "field": "codes",
                        "offset": 3
                      },
                      "_score": 3.2227304,
                      "_source": {
                        "id": "J15.2",
                        "description": "test two three world J15"
                      }
                    },
                    {
                      "_index": "stof_64170045",
                      "_type": "_doc",
                      "_id": "1",
                      "_nested": {
                        "field": "codes",
                        "offset": 2
                      },
                      "_score": 2.0622847,
                      "_source": {
                        "id": "J15.1",
                        "description": "test two world J15.0"
                      }
                    }
                  ]
                }
              }
            }
          }
    

    【讨论】:

    • 请看我的回答,这里根本原因不同,他的索引设置是正确的,只是由于默认大小限制为3,OP没有得到所有的搜索结果。
    • @bhavya 您能否解释一下您所做的更改。真的很有帮助!
    • @Vishnu 是我的回答中给出的搜索结果,是否符合您的预期结果?
    • @bhavya 不,我无法匹配我的预期结果。正如opster所说的那样。内命中尺寸
    猜你喜欢
    • 2016-08-05
    • 1970-01-01
    • 2021-07-06
    • 2021-07-12
    • 1970-01-01
    • 1970-01-01
    • 2020-08-28
    • 2020-05-04
    • 1970-01-01
    相关资源
    最近更新 更多