ElasticSearch：建议完成多搜索答案

【问题标题】：ElasticSearch: Suggestion Completion Multi SearchElasticSearch：建议完成多搜索
【发布时间】：2015-12-08 23:31:28
【问题描述】：

我在 ES 中使用建议 api 并完成。我的实现有效（下面的代码），但我想在查询中搜索多个单词。在下面的示例中，如果我查询搜索“word”，它会找到“wordpress”并输出“Found”。我想要完成的是使用“word blog magazine”之类的东西进行查询，这些都是标签并且输出为“Found”。任何帮助，将不胜感激！

映射：

curl -XPUT "http://localhost:9200/test_index/" -d'
    {
   "mappings": {
      "product": {
         "properties": {
            "description": {
               "type": "string"
            },
            "tags": {
               "type": "string"
            },
            "title": {
               "type": "string"
            },
            "tag_suggest": {
               "type": "completion",
               "index_analyzer": "simple",
               "search_analyzer": "simple",
               "payloads": false
            }
         }
      }
   }
}'

添加文档：

curl -XPUT "http://localhost:9200/test_index/product/1" -d'
    {
   "title": "Product1",
   "description": "Product1 Description",
   "tags": [
      "blog",
      "magazine",
      "responsive",
      "two columns",
      "wordpress"
   ],
   "tag_suggest": {
      "input": [
         "blog",
         "magazine",
         "responsive",
         "two columns",
         "wordpress"
      ],
      "output": "Found"
   }
}'

_建议查询：

curl -XPOST "http://localhost:9200/test_index/_suggest" -d'
    {
    "product_suggest":{
        "text":"word",
        "completion": {
            "field" : "tag_suggest"
        }
    }
}'
The results are as we would expect:
    {
    "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "product_suggest": [
      {
         "text": "word",
         "offset": 0,
         "length": 4,
         "options": [
            {
           "text": "Found",
           "score": 1
        },
         ]
      }
   ]
}

【问题讨论】：

您愿意使用 ngram 解决方案而不是完成建议吗？
实际上我之前用模糊的方式实现了边缘 gram，但我的分数都搞砸了，建议使用建议 api 来更快地查询大量数据。你对两者有何看法？对我来说，一个关键要求是用空格分隔多个搜索
最后一部分很容易使用 ngram 解决方案。不过，不确定得分。而且我不确定是否要做一个多学期的完成建议。我得调查一下。我假设您想要 OR 搜索，而不是 AND，对吗？
你指的是哪一部分？我会给你一个例子来帮助解释我的要求。我有一个包含第一个、中间、最后一个 dob 字段的用户类型。我想做一个搜索，一个查询可以处理任何数量的字段的任何顺序。例如：last、first、dob、middle 或 first、dob 或 just dob 并返回用户。前端将是一个用于输入信息的文本块。非常感谢您的帮助！

标签： search elasticsearch autocomplete search-suggestion

【解决方案1】：

如果你愿意改用edge ngrams（或者如果你需要完整的ngrams），我认为它会解决你的问题。

我在这篇博文中写了一个关于如何做到这一点的非常详细的解释：

https://qbox.io/blog/an-introduction-to-ngrams-in-elasticsearch

但我会在这里给你一个快速而肮脏的版本。诀窍是将 ngram 与 _all field 和 match AND operator 一起使用。

所以有了这个映射：

PUT /test_index
{
   "settings": {
      "analysis": {
         "filter": {
            "ngram_filter": {
               "type": "edge_ngram",
               "min_gram": 2,
               "max_gram": 20
            }
         },
         "analyzer": {
            "ngram_analyzer": {
               "type": "custom",
               "tokenizer": "standard",
               "filter": [
                  "lowercase",
                  "ngram_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "_all": {
            "type": "string",
            "analyzer": "ngram_analyzer",
            "search_analyzer": "standard"
         },
         "properties": {
            "word": {
               "type": "string",
               "include_in_all": true
            },
            "definition": {
               "type": "string",
               "include_in_all": true
            }
         }
      }
   }
}

还有一些文件：

PUT /test_index/_bulk
{"index":{"_index":"test_index","_type":"doc","_id":1}}
{"word":"democracy", "definition":"government by the people; a form of government in which the supreme power is vested in the people and exercised directly by them or by their elected agents under a free electoral system."}
{"index":{"_index":"test_index","_type":"doc","_id":2}}
{"word":"republic", "definition":"a state in which the supreme power rests in the body of citizens entitled to vote and is exercised by representatives chosen directly or indirectly by them."}
{"index":{"_index":"test_index","_type":"doc","_id":3}}
{"word":"oligarchy", "definition":"a form of government in which all power is vested in a few persons or in a dominant class or clique; government by the few."}
{"index":{"_index":"test_index","_type":"doc","_id":4}}
{"word":"plutocracy", "definition":"the rule or power of wealth or of the wealthy."}
{"index":{"_index":"test_index","_type":"doc","_id":5}}
{"word":"theocracy", "definition":"a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."}
{"index":{"_index":"test_index","_type":"doc","_id":6}}
{"word":"monarchy", "definition":"a state or nation in which the supreme power is actually or nominally lodged in a monarch."}
{"index":{"_index":"test_index","_type":"doc","_id":7}}
{"word":"capitalism", "definition":"an economic system in which investment in and ownership of the means of production, distribution, and exchange of wealth is made and maintained chiefly by private individuals or corporations, especially as contrasted to cooperatively or state-owned means of wealth."}
{"index":{"_index":"test_index","_type":"doc","_id":8}}
{"word":"socialism", "definition":"a theory or system of social organization that advocates the vesting of the ownership and control of the means of production and distribution, of capital, land, etc., in the community as a whole."}
{"index":{"_index":"test_index","_type":"doc","_id":9}}
{"word":"communism", "definition":"a theory or system of social organization based on the holding of all property in common, actual ownership being ascribed to the community as a whole or to the state."}
{"index":{"_index":"test_index","_type":"doc","_id":10}}
{"word":"feudalism", "definition":"the feudal system, or its principles and practices."}
{"index":{"_index":"test_index","_type":"doc","_id":11}}
{"word":"monopoly", "definition":"exclusive control of a commodity or service in a particular market, or a control that makes possible the manipulation of prices."}
{"index":{"_index":"test_index","_type":"doc","_id":12}}
{"word":"oligopoly", "definition":"the market condition that exists when there are few sellers, as a result of which they can greatly influence price and other market factors."}

我可以在两个字段中应用部分匹配（可以使用任意数量的字段），如下所示：

POST /test_index/_search
{
    "query": {
        "match": {
           "_all": {
               "query": "theo go",
               "operator": "and"
           }
        }
    }
}

在这种情况下，返回：

{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 1,
      "max_score": 0.7601639,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "5",
            "_score": 0.7601639,
            "_source": {
               "word": "theocracy",
               "definition": "a form of government in which God or a deity is recognized as the supreme civil ruler, the God's or deity's laws being interpreted by the ecclesiastical authorities."
            }
         }
      ]
   }
}

这是我在这里使用的代码（博文中有更多内容）：

http://sense.qbox.io/gist/e4093c25a8257499f54ced5a09f35b1eb48e4e3c

希望对您有所帮助。

【讨论】：

谢谢，我之前确实看过你的博客，我觉得它很棒！在您看来，对于这种情况，您为什么会倾向于 n-gram 路线然后使用建议 api？您之前是否也见过在结合 n-gram 和模糊性时，评分变得不合时宜？
我喜欢 ngram，因为你不需要冗余数据。在可能变得重要的大型数据集中。分数肯定是个问题。我的感觉是有办法解决这个问题，但我不知道该怎么做。
谢谢，你为什么要这样做：“analyzer”：“ngram_analyzer”“search_analyzer”：“standard”，而不仅仅是“analyzer”：“ngram_analyzer”？
因为您不希望您的搜索词以与文档文本相同的方式进行分析。这样你最终会得到很多不相关的结果，因为你的搜索词的所有 ngram 也会匹配来自许多其他文档的 ngram，至少可能是这样。查看博客文章中的“index_analyzer 与 search_analyzer”部分了解更多信息。（此外，“index_analyzer”未在 2.0 中使用；仅使用“analyzer”）。
啊，好吧！这是有道理的，因为我遇到了一个错误，但我使用的是 2.1。感谢您的帮助:)