【问题标题】:django-haystack autocomplete returns too wide resultsdjango-haystack 自动完成返回太宽的结果
【发布时间】:2015-03-12 11:37:09
【问题描述】:

我创建了一个带有字段title_auto的索引:

class GameIndex(indexes.SearchIndex, indexes.Indexable):
    text = indexes.CharField(document=True, model_attr='title')
    title = indexes.CharField(model_attr='title')
    title_auto = indexes.NgramField(model_attr='title')

弹性搜索设置如下所示:

ELASTICSEARCH_INDEX_SETTINGS = {
    'settings': {
        "analysis": {
            "analyzer": {
                "ngram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_ngram"],
                    "token_chars": ["letter", "digit"]
                },
                "edgengram_analyzer": {
                    "type": "custom",
                    "tokenizer": "lowercase",
                    "filter": ["haystack_edgengram"]
                }
            },
            "tokenizer": {
                "haystack_ngram_tokenizer": {
                    "type": "nGram",
                    "min_gram": 1,
                    "max_gram": 15,
                },
                "haystack_edgengram_tokenizer": {
                    "type": "edgeNGram",
                    "min_gram": 1,
                    "max_gram": 15,
                    "side": "front"
                }
            },
            "filter": {
                "haystack_ngram": {
                    "type": "nGram",
                    "min_gram": 1,
                    "max_gram": 15
                },
                "haystack_edgengram": {
                    "type": "edgeNGram",
                    "min_gram": 1,
                    "max_gram": 15
                }
            }
        }
    }
}

我尝试进行自动完成搜索,它有效,但是返回了太多不相关的结果:

qs = SearchQuerySet().models(Game).autocomplete(title_auto=search_phrase)

qs = SearchQuerySet().models(Game).filter(title_auto=search_phrase)

它们都产生相同的输出。

如果 search_phrase 是“垄断”,则第一个结果的标题中包含“垄断”,但是,由于只有 2 个相关项目,因此返回 51。其他与“垄断”完全无关。

所以我的问题是 - 如何更改结果的相关性?

【问题讨论】:

    标签: django autocomplete elasticsearch django-haystack


    【解决方案1】:

    由于我没有看到您的完整映射,因此很难确定,但我怀疑问题在于分析器(其中之一)同时用于索引和搜索。因此,当您为文档编制索引时,会创建许多 ngram 术语并为其编制索引。如果您搜索并且您的搜索文本也以相同的方式进行分析,则会生成大量搜索词。由于最小的 ngram 是单个字母,因此几乎任何查询都会匹配很多文档。

    我们写了一篇关于使用 ngram 进行自动完成的博文,您可能会发现它对您有所帮助:http://blog.qbox.io/multi-field-partial-word-autocomplete-in-elasticsearch-using-ngrams。但我会给你一个更简单的例子来说明我的意思。我对 haystack 不是很熟悉,所以我可能无法帮助您,但我可以解释 Elasticsearch 中的 ngrams 问题。

    首先,我将设置一个索引,该索引使用 ngram 分析器进行索引和搜索:

    PUT /test_index
    {
       "settings": {
           "number_of_shards": 1,
          "analysis": {
             "filter": {
                "nGram_filter": {
                   "type": "nGram",
                   "min_gram": 1,
                   "max_gram": 15,
                   "token_chars": [
                      "letter",
                      "digit",
                      "punctuation",
                      "symbol"
                   ]
                }
             },
             "analyzer": {
                "nGram_analyzer": {
                   "type": "custom",
                   "tokenizer": "whitespace",
                   "filter": [
                      "lowercase",
                      "asciifolding",
                      "nGram_filter"
                   ]
                }
             }
          }
       },
       "mappings": {
            "doc": {
                "properties": {
                    "title": {
                        "type": "string", 
                        "analyzer": "nGram_analyzer"
                    }
                }
            }
       }
    }
    

    并添加一些文档:

    PUT /test_index/_bulk
    {"index":{"_index":"test_index","_type":"doc","_id":1}}
    {"title":"monopoly"}
    {"index":{"_index":"test_index","_type":"doc","_id":2}}
    {"title":"oligopoly"}
    {"index":{"_index":"test_index","_type":"doc","_id":3}}
    {"title":"plutocracy"}
    {"index":{"_index":"test_index","_type":"doc","_id":4}}
    {"title":"theocracy"}
    {"index":{"_index":"test_index","_type":"doc","_id":5}}
    {"title":"democracy"}
    

    然后运行一个简单的match 搜索"poly"

    POST /test_index/_search
    {
        "query": {
            "match": {
               "title": "poly"
            }
        }
    }
    

    它返回所有五个文档:

    {
       "took": 3,
       "timed_out": false,
       "_shards": {
          "total": 1,
          "successful": 1,
          "failed": 0
       },
       "hits": {
          "total": 5,
          "max_score": 4.729521,
          "hits": [
             {
                "_index": "test_index",
                "_type": "doc",
                "_id": "2",
                "_score": 4.729521,
                "_source": {
                   "title": "oligopoly"
                }
             },
             {
                "_index": "test_index",
                "_type": "doc",
                "_id": "1",
                "_score": 4.3608603,
                "_source": {
                   "title": "monopoly"
                }
             },
             {
                "_index": "test_index",
                "_type": "doc",
                "_id": "3",
                "_score": 1.0197333,
                "_source": {
                   "title": "plutocracy"
                }
             },
             {
                "_index": "test_index",
                "_type": "doc",
                "_id": "4",
                "_score": 0.31496215,
                "_source": {
                   "title": "theocracy"
                }
             },
             {
                "_index": "test_index",
                "_type": "doc",
                "_id": "5",
                "_score": 0.31496215,
                "_source": {
                   "title": "democracy"
                }
             }
          ]
       }
    }
    

    这是因为搜索词 "poly" 被标记为词 "p""o""l""y",因为每个文档中的 "title" 字段被标记为单字母词,匹配每个文档。

    如果我们改为使用此映射重建索引(相同的分析器和文档):

    "mappings": {
      "doc": {
         "properties": {
            "title": {
               "type": "string",
               "index_analyzer": "nGram_analyzer",
               "search_analyzer": "standard"
            }
         }
      }
    }
    

    查询将返回我们期望的结果:

    POST /test_index/_search
    {
        "query": {
            "match": {
               "title": "poly"
            }
        }
    }
    ...
    {
       "took": 1,
       "timed_out": false,
       "_shards": {
          "total": 1,
          "successful": 1,
          "failed": 0
       },
       "hits": {
          "total": 2,
          "max_score": 1.5108256,
          "hits": [
             {
                "_index": "test_index",
                "_type": "doc",
                "_id": "1",
                "_score": 1.5108256,
                "_source": {
                   "title": "monopoly"
                }
             },
             {
                "_index": "test_index",
                "_type": "doc",
                "_id": "2",
                "_score": 1.5108256,
                "_source": {
                   "title": "oligopoly"
                }
             }
          ]
       }
    }
    

    边缘 ngram 的工作方式类似,除了只使用从单词开头开始的术语。

    这是我用于此示例的代码:

    http://sense.qbox.io/gist/b24cbc531b483650c085a42963a49d6a23fa5579

    【讨论】:

    • 谢谢,这解释了我很多。
    【解决方案2】:

    不幸的是,目前似乎没有办法(除了实现自定义后端)通过 Django-Haystack 分别配置搜索分析器和索引分析器。 如果 Django-Haystack 自动完成返回的结果太宽,您可以使用每个搜索结果提供的分数值来优化输出。

    if search_query != "":
    # Use autocomplete query or filter
    # with results_filtered being a SearchQuerySet()
        results_filtered = results_filtered.filter(text=search_query)
    
    #Remove objects with a low score
    for result in results_filtered:
        if result.score < SEARCH_SCORE_THRESHOLD:
            results_filtered = results_filtered.exclude(id=result.id)
    

    它对我来说效果很好,无需定义我自己的后端和方案构建。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-06-24
      • 2013-02-10
      • 1970-01-01
      • 2014-07-22
      相关资源
      最近更新 更多