ElasticSearch Suggester 全文搜索答案

【问题标题】：ElasticSearch Suggester full-text-searchElasticSearch Suggester 全文搜索
【发布时间】：2021-01-24 14:40:47
【问题描述】：

我正在使用 django_elasticsearch_dsl。

我的文档：

html_strip = analyzer(
    'html_strip',
    tokenizer='standard',
    filter=["lowercase", "stop", "snowball"],
    char_filter=["html_strip"]
)

class Document(django_elasticsearch_dsl.Document):
    name = TextField(
        analyzer=html_strip,
        fields={
            'raw': fields.KeywordField(),
            'suggest': fields.CompletionField(),
        }
    )
    ...

我的要求：

_search = Document.search().suggest("suggestions", text=query, completion={'field': 'name.suggest'}).execute()

我已将以下文档“名称”编入索引：

"This is a test"
"this is my test"
"this test"
"Test this"

现在如果搜索This is my text if 将只收到

"this is my text"

但是，如果我搜索 test，那么我得到的只是

"Test this"

即使我想要所有名称中包含 test 的文档。

我错过了什么？

【问题讨论】：

您有机会浏览我的回答吗，期待您的反馈？？？？

标签： python django elasticsearch elasticsearch-dsl

【解决方案1】：

根据用户给出的评论，使用 ngrams 添加另一个答案

添加一个包含索引映射、索引数据、搜索查询和搜索结果的工作示例

索引映射：

{
  "settings": {
    "analysis": {
      "filter": {
        "ngram_filter": {
          "type": "ngram",
          "min_gram": 4,
          "max_gram": 20
        }
      },
      "analyzer": {
        "ngram_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "ngram_filter"
          ]
        }
      }
    },
    "max_ngram_diff": 50
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "ngram_analyzer",
        "search_analyzer": "standard"
      }
    }
  }
}

索引数据：

{
  "name": [
    "Test this"
  ]
}

{
  "name": [
    "This is a test"
  ]
}

{
  "name": [
    "this is my test"
  ]
}

{
  "name": [
    "this test"
  ]
}

分析 API：

POST/_analyze

{
  "analyzer" : "ngram_analyzer",
  "text" : "this is my test"
}

生成以下令牌：

{
  "tokens": [
    {
      "token": "this",
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "test",
      "start_offset": 11,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}

搜索查询：

{
    "query": {
        "match": {
           "name": "test"
        }
    }
}

搜索结果：

"hits": [
      {
        "_index": "stof_64281341",
        "_type": "_doc",
        "_id": "4",
        "_score": 0.2876821,
        "_source": {
          "name": [
            "Test this"
          ]
        }
      },
      {
        "_index": "stof_64281341",
        "_type": "_doc",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "name": [
            "this is my test"
          ]
        }
      },
      {
        "_index": "stof_64281341",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "name": [
            "This is a test"
          ]
        }
      },
      {
        "_index": "stof_64281341",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "name": [
            "this test"
          ]
        }
      }
    ]

对于模糊搜索，您可以使用以下搜索查询：

{
  "query": {
    "fuzzy": {
      "name": {
        "value": "tst"    <-- used tst in place of test
      }
    }
  }
}

【讨论】：

@Shezan Kazi 添加了另一个答案（因为在上面添加相同的答案，会使该答案太长）以使用 n-gram 实现您的用例。请仔细阅读我的回答，如果这解决了您的问题，请告诉我？ ?

【解决方案2】：

最好的补全提示器，可以匹配中间 fields 是 n-gram 过滤器。

您可以使用多个建议，其中一个建议基于前缀，并且您可以使用正则表达式在字段中间进行匹配。

我不知道 django_elasticsearch_dsl，添加了一个带有索引映射、数据、搜索查询和搜索结果的工作示例

索引映射：

{
  "mappings": {
    "properties": {
      "name": {
        "type": "completion"
      }
    }
  }
}

索引数据：

{
  "name": {
    "input": ["Test this"]
  }
}
{
  "name": {
    "input": ["this is my test"]
  }
}
{
  "name": {
    "input": ["This is a test"]
  }
}
{
  "name": {
    "input": ["this test"]
  }
}

搜索查询：

    {
        "suggest": {
            "suggest-exact": {
                "prefix": "test",
                "completion": {
                    "field": "name",
                    "skip_duplicates": true
                }
            },
            "suggest-regex": {
                "regex": ".*test.*",
                "completion": {
                    "field": "name",
                    "skip_duplicates": true
                }
            }
        }
    }

搜索结果：

"suggest": {
    "suggest-exact": [
      {
        "text": "test",
        "offset": 0,
        "length": 4,
        "options": [
          {
            "text": "Test this",
            "_index": "stof_64281341",
            "_type": "_doc",
            "_id": "4",
            "_score": 1.0,
            "_source": {
              "name": {
                "input": [
                  "Test this"
                ]
              }
            }
          }
        ]
      }
    ],
    "suggest-regex": [
      {
        "text": ".*test.*",
        "offset": 0,
        "length": 8,
        "options": [
          {
            "text": "Test this",
            "_index": "stof_64281341",
            "_type": "_doc",
            "_id": "4",
            "_score": 1.0,
            "_source": {
              "name": {
                "input": [
                  "Test this"
                ]
              }
            }
          },
          {
            "text": "This is a test",
            "_index": "stof_64281341",
            "_type": "_doc",
            "_id": "1",
            "_score": 1.0,
            "_source": {
              "name": {
                "input": [
                  "This is a test"
                ]
              }
            }
          },
          {
            "text": "this is my test",
            "_index": "stof_64281341",
            "_type": "_doc",
            "_id": "2",
            "_score": 1.0,
            "_source": {
              "name": {
                "input": [
                  "this is my test"
                ]
              }
            }
          },
          {
            "text": "this test",
            "_index": "stof_64281341",
            "_type": "_doc",
            "_id": "3",
            "_score": 1.0,
            "_source": {
              "name": {
                "input": [
                  "this test"
                ]
              }
            }
          }
        ]
      }

【讨论】：

@Shezan Kazi 上面的查询工作正常，但正则表达式的使用成本很高。如果你愿意，我也可以提供使用 n-gram 的解决方案。请仔细阅读我的回答，如果这解决了您的问题，请告诉我?
这就像一个魅力。问题是，elasticsearch-dsl 不支持search() 中的regex。如果您可以为 ngrams 发布解决方案，那就太好了。
我现在能看到的唯一问题是没有处理错别字，因为fuzzy 不是regexsuggestions 的选项。有什么解决方法吗？
@ShezanKazi 处理错别字，请查看我的以下答案，如果这有助于您解决问题，请告诉我？