elasticsearch上下文建议停用词答案

【问题标题】：elasticsearch context suggester stopwordselasticsearch上下文建议停用词
【发布时间】：2015-03-12 20:45:49
【问题描述】：

有没有办法分析传递给上下文建议器的字段？如果，比如说，我的映射中有这个：

mappings: {
    myitem: {
        title: {type: 'string'},
        content: {type: 'string'},
        user: {type: 'string', index: 'not_analyzed'},
        suggest_field: {
            type: 'completion',
            payloads: false,
            context: {
                user: {
                    type: 'category',
                    path: 'user'
                },
            }
        }
    }
}

我索引这个文档：

POST /myindex/myitem/1
{
    title: "The Post Title",
    content: ...,
    user: 123,
    suggest_field: {
        input: "The Post Title",
        context: {
            user: 123
        }
    }
}

我想首先分析输入，将其拆分为单独的单词，通过小写和停用词过滤器运行它，以便上下文提示器真正得到 p>

    suggest_field: {
        input: ["post", "title"],
        context: {
            user: 123
        }
    }

我知道我可以将数组传递给建议字段，但我想避免在传递给 ES 之前将文本小写、拆分、在我的应用程序中运行停用词过滤器。如果可能的话，我宁愿 ES 为我做这件事。我确实尝试将 index_analyzer 添加到字段映射中，但这似乎没有任何效果。

或者，还有其他方法可以获取单词的自动完成建议吗？

【问题讨论】：

这是另一种方法，您可以使用 ngrams（然后您可以进行所有您想要的分析），尽管它涉及更多：blog.qbox.io/…
话虽如此，我认为也有一种方法可以通过完成建议来做你想做的事（我们也写了一篇关于此的博客文章：blog.qbox.io/…）。我会看看我是否能让你正在尝试的工作。
Sloan，那篇文章非常好，对我刚开始接触 ES 的时候帮助很大。但是，正如文章所说：“键入“disn”应返回包含“Disney”的结果”。我不想要“包含“迪士尼”的结果。我想要“迪士尼”，就是这样！我不在乎它属于哪个结果。
您是否尝试过使用术语聚合（或方面）？
嗯，不。根本不知道有这种事！让我检查一下并回复你。

标签： elasticsearch autosuggest

【解决方案1】：

好的，所以这很复杂，但我认为它或多或少可以满足您的需求。我不打算解释整个事情，因为这需要相当多的时间。但是，我会说我从this blog post 开始并添加了stop token filter。 "title" 字段具有使用不同分析器的子字段（以前称为multi_field），或者没有。该查询包含几个terms aggregations。另请注意，聚合结果由匹配查询过滤，仅返回与文本查询相关的结果。

这里是索引设置（花一些时间看一下；如果您有具体问题，我会尽力回答，但我鼓励您先阅读博文）：

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0,
      "analysis": {
         "filter": {
            "nGram_filter": {
               "type": "nGram",
               "min_gram": 2,
               "max_gram": 20,
               "token_chars": [
                  "letter",
                  "digit",
                  "punctuation",
                  "symbol"
               ]
            },
            "stop_filter": {
               "type": "stop"
            }
         },
         "analyzer": {
            "nGram_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "stop_filter",
                  "nGram_filter"
               ]
            },
            "whitespace_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "lowercase",
                  "asciifolding",
                  "stop_filter"
               ]
            },
            "stopword_only_analyzer": {
               "type": "custom",
               "tokenizer": "whitespace",
               "filter": [
                  "asciifolding",
                  "stop_filter"
               ]
            }
         }
      }
   },
   "mappings": {
      "doc": {
         "properties": {
            "title": {
               "type": "string",
               "index_analyzer": "nGram_analyzer",
               "search_analyzer": "whitespace_analyzer",
               "fields": {
                  "raw": {
                     "type": "string",
                     "index": "not_analyzed"
                  },
                  "stopword_only": {
                     "type": "string",
                     "analyzer": "stopword_only_analyzer"
                  }
               }
            }
         }
      }
   }
}

然后我添加了一些文档：

PUT /test_index/_bulk
{"index": {"_index":"test_index", "_type":"doc", "_id":1}}
{"title": "The Lion King"}
{"index": {"_index":"test_index", "_type":"doc", "_id":2}}
{"title": "Beauty and the Beast"}
{"index": {"_index":"test_index", "_type":"doc", "_id":3}}
{"title": "Alladin"}
{"index": {"_index":"test_index", "_type":"doc", "_id":4}}
{"title": "The Little Mermaid"}
{"index": {"_index":"test_index", "_type":"doc", "_id":5}}
{"title": "Lady and the Tramp"}

现在我可以根据需要搜索带有单词前缀的文档（或完整的单词，无论是否大写），并使用聚合来返回匹配文档的完整标题以及完整（非小写）单词, 减去停用词：

POST /test_index/_search?search_type=count
{
    "query": {
      "match": {
         "title": {
            "query": "mer king",
            "operator": "or"
         }
      }
   }, 
    "aggs": {
        "word_tokens": {
            "terms": { "field": "title.stopword_only" }
        },
        "intact_titles": {
            "terms": { "field": "title.raw" }
        }
    }
}
...
{
   "took": 2,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0,
      "hits": []
   },
   "aggregations": {
      "intact_titles": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "The Lion King",
               "doc_count": 1
            },
            {
               "key": "The Little Mermaid",
               "doc_count": 1
            }
         ]
      },
      "word_tokens": {
         "doc_count_error_upper_bound": 0,
         "sum_other_doc_count": 0,
         "buckets": [
            {
               "key": "The",
               "doc_count": 2
            },
            {
               "key": "King",
               "doc_count": 1
            },
            {
               "key": "Lion",
               "doc_count": 1
            },
            {
               "key": "Little",
               "doc_count": 1
            },
            {
               "key": "Mermaid",
               "doc_count": 1
            }
         ]
      }
   }
}

注意"The" 被返回。这似乎是因为默认的_english_ 停用词只包含"the"。我没有立即找到解决方法。

这是我使用的代码：

http://sense.qbox.io/gist/2fbb8a16b2cd35370f5d5944aa9ea7381544be79

如果这能帮助您解决问题，请告诉我。

【讨论】：

"The" vs. "the" 不是问题，因为我可以在 stopword_only_analyzer 中小写。问题是对于“kin”，它返回 lion（除了 king，这很好）。需要自动完成，我只需要 kin* 单词。 “merm”也一样——它返回的也很少，我只需要美人鱼。现在我分两遍执行此操作：首先我将文本发送到 /test_index/_analyze 并取回标记，然后将标记保存为建议器输入数组。虽然这行得通，但我很想一次性完成。

【解决方案2】：

您可以设置一个为您执行此操作的分析器。

如果您按照名为 you complete me 的教程进行操作，则有一个关于停用词的部分。

在撰写本文后，elasticsearch 的工作方式发生了变化。 standard 分析器 no logner 不会删除停用词，因此您需要改用 stop 分析器。

映射

curl -X DELETE localhost:9200/hotels
curl -X PUT localhost:9200/hotels -d '
{
  "mappings": {
    "hotel" : {
      "properties" : {
        "name" : { "type" : "string" },
        "city" : { "type" : "string" },
        "name_suggest" : {
          "type" :            "completion",
          "index_analyzer" :  "stop",//NOTE HERE THE DIFFERENCE 
          "search_analyzer" : "stop",//FROM THE ARTICELE!!
          "preserve_position_increments": false,
          "preserve_separators": false
        }
      } 
    }
  }
}'

得到建议

curl -X POST localhost:9200/hotels/_suggest -d '
{
  "hotels" : {
    "text" : "m",
    "completion" : {
      "field" : "name_suggest"
    }
  }
}'

希望这会有所帮助。我自己花了很长时间寻找这个答案。

【讨论】：