【问题标题】:Elasticsearch Similar Text QueryElasticsearch 相似文本查询
【发布时间】:2020-07-17 06:20:47
【问题描述】:

给定索引中的以下文档(我们称之为地址):

{
    ADDRESS: {
        ID: 1,
        LINE1: "steet 1",
        CITY: "kuala lumpur",
        COUNTRY: "MALAYSIA",
        ...
    } 
}
{
    ADDRESS: {
        ID: 2,
        LINE1: "steet 1",
        CITY: "kualalumpur city",
        COUNTRY: "MALAYSIA",
        ...
    }
}
{
    ADDRESS: {
        ID: 3,
        LINE1: "steet 1",
        CITY: "kualalumpur",        
        COUNTRY: "MALAYSIA",
        ...
    }
}
{
    ADDRESS: {
        ID: 4,
        LINE1: "steet 1",
        CITY: "kuala lumpur city",      
        COUNTRY: "MALAYSIA",
        ...
    }
}

此时,我找到了使用搜索文本“kualalumpur”抓取“kualalumpur”、“kualalumpur”、“kualalumpur city”的查询。
但是,尽管与“kualalumpur city”几乎相似,但结果中缺少“kuala lumpur city”。

这是我目前的查询:

{
  "query": {
    "bool": {
      "should": [
          {"match": {"ADDRESS.STREET": {"query": "street 1", "fuzziness": 1, "operator": "AND"}}},
          {
            "bool": {
              "should": [
                {"match": {"ADDRESS.CITY": {"query": "kualalumpur", "fuzziness": 1, "operator": "OR"}}},
                {"match": {"ADDRESS.CITY.keyword": {"query": "kualalumpur", "fuzziness": 1, "operator": "OR"}}}
              ]
            }
          }
        ],
      "filter": {
        "bool": {
          "must": [
            {"term": {"ADDRESS.COUNTRY.keyword": "MALAYSIA"}}
          ]
        }
      },
      "minimum_should_match": 2
    }
  }
}

鉴于条件,Elasticsearch 是否有可能返回所有四个带有搜索文本“kualalumpur”的文档?

【问题讨论】:

  • 如果您能告诉我它是否解决了您的问题,那就太好了。
  • 嘿!确实如此!谢谢,一个后续问题,在这种情况下选择边缘 n-gram 而不是 n-gram 有什么好处?
  • 很高兴它帮助并感谢您的支持和接受答案,您需要前缀类型的搜索,例如 kualalumpur 而不是 ualal,这是中缀搜索且成本高昂,并且边缘 - n gram 创建的令牌少得多等等适合您用例的高性能
  • 请仔细阅读我的详细答案stackoverflow.com/a/60584211/4039431,我也链接了我的博客,如果你喜欢这个答案,别忘了点赞:)
  • 太棒了!再次感谢:D

标签: elasticsearch fuzzy-search elasticsearch-query


【解决方案1】:

您可以在country 字段上使用edge-n gram tokenizer 来获取所有四个文档,在我的本地尝试过并添加以下工作示例。

创建自定义分析器并将其应用于您的领域

{
    "settings": {
        "index": {
            "analysis": {
                "analyzer": {
                    "ngram_analyzer": {
                        "type": "custom",
                        "filter": [
                            "lowercase"
                        ],
                        "tokenizer": "edgeNGramTokenizer"
                    }
                },
                "tokenizer": {
                    "edgeNGramTokenizer": {
                        "token_chars": [
                            "letter",
                            "digit"
                        ],
                        "min_gram": "1",
                        "type": "edgeNGram",
                        "max_gram": "40"
                    }
                }
            },
            "max_ngram_diff": "50"
        }
    },
    "mappings": {
        "properties": {
            "country": {
                "type": "text",
                "analyzer" : "ngram_analyzer"
            }
        }
    }
}

索引所有四个示例文档,如下所示

{
  "country" : "kuala lumpur"
}

带有术语 kualalumpur 的搜索查询匹配所有四个文档

{
    "query": {
        "match" : {
            "country" : "kualalumpur"
        }
    }
}

 "hits": [
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "3",
        "_score": 5.0003963,
        "_source": {
          "country": "kualalumpur"
        }
      },
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "2",
        "_score": 4.4082437,
        "_source": {
          "country": "kualalumpur city"
        }
      },
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.5621849,
        "_source": {
          "country": "kuala lumpur"
        }
      },
      {
        "_index": "fuzzy",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.4956103,
        "_source": {
          "country": "kuala lumpur city"
        }
      }
    ]

 

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-12-11
    • 2021-12-03
    • 1970-01-01
    • 2014-06-09
    • 2016-10-27
    • 1970-01-01
    相关资源
    最近更新 更多