Elasticsearch 更喜欢这个查询答案

【问题标题】：Elasticsearch More Like This QueryElasticsearch 更喜欢这个查询
【发布时间】：2015-04-03 04:43:27
【问题描述】：

我正在努力思考more like this 查询的工作原理，但我似乎遗漏了一些东西。我阅读了文档，但 ES 文档通常有点……缺乏。

我们的目标是能够按词频限制结果，正如 here 所尝试的那样。

所以我设置了一个简单的索引，包括用于调试的术语向量，然后添加了两个简单的文档。

DELETE /test_index

PUT /test_index
{
   "settings": {
      "number_of_shards": 1,
      "number_of_replicas": 0
   },
   "mappings": {
      "doc": {
         "properties": {
            "text": {
               "type": "string",
               "term_vector": "yes"
            }
         }
      }
   }
}

PUT /test_index/doc/1
{
    "text": "apple, apple, apple, apple, apple"
}

PUT /test_index/doc/2
{
    "text": "apple, apple"
}

当我查看术语向量时，我看到了我的期望：

GET /test_index/doc/1/_termvector
...
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "1",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text": {
         "field_statistics": {
            "sum_doc_freq": 2,
            "doc_count": 2,
            "sum_ttf": 7
         },
         "terms": {
            "apple": {
               "term_freq": 5
            }
         }
      }
   }
}

GET /test_index/doc/2/_termvector
{
   "_index": "test_index",
   "_type": "doc",
   "_id": "2",
   "_version": 1,
   "found": true,
   "term_vectors": {
      "text": {
         "field_statistics": {
            "sum_doc_freq": 2,
            "doc_count": 2,
            "sum_ttf": 7
         },
         "terms": {
            "apple": {
               "term_freq": 2
            }
         }
      }
   }
}

当我使用 "min_term_freq": 1 运行以下查询时，我会返回两个文档：

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple",
         "min_term_freq": 1,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 2,
      "max_score": 0.5816214,
      "hits": [
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "1",
            "_score": 0.5816214,
            "_source": {
               "text": "apple, apple, apple, apple, apple"
            }
         },
         {
            "_index": "test_index",
            "_type": "doc",
            "_id": "2",
            "_score": 0.5254995,
            "_source": {
               "text": "apple, apple"
            }
         }
      ]
   }
}

但是，如果我将 "min_term_freq" 增加到 2（或更多），我将一无所获，尽管我希望这两个文档都会返回：

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple",
         "min_term_freq": 2,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}
...
{
   "took": 1,
   "timed_out": false,
   "_shards": {
      "total": 1,
      "successful": 1,
      "failed": 0
   },
   "hits": {
      "total": 0,
      "max_score": null,
      "hits": []
   }
}

为什么？我错过了什么？

如果我想设置一个查询，只返回 "apple" 出现 5 次的文档，而不是出现 2 次的文档，有没有更好的方法？

为了方便，这里是代码：

http://sense.qbox.io/gist/341f9f77a6bd081debdcaa9e367f5a39be9359cc

【问题讨论】：

标签： elasticsearch morelikethis

【解决方案1】：

在进行 MLT 之前，实际上在输入上应用了最小词条频率和最小文档频率。这意味着由于您的输入文本中只有一次出现 apple ，因此 apple 从未符合 MLT 条件，因为 min term frequency 设置为 2。如果您将输入更改为“apple apple”，如下所示，一切都会奏效 -

POST /test_index/_search
{
   "query": {
      "more_like_this": {
         "fields": [
            "text"
         ],
         "like_text": "apple apple",
         "min_term_freq": 2,
         "percent_terms_to_match": 1,
         "min_doc_freq": 1
      }
   }
}

最小文档频率也是如此。 Apple 在 atleast 2 document 中找到，因此 min_doc_freq upto 2 将有资格从输入文本中应用 MLT 操作。

【讨论】：

谢谢，Vineeth。这行得通，尽管我仍然不明白为什么。如果我搜索 {... "like_text": "apple apple apple", "min_term_freq": 3,...} 我仍然会得到两个结果，即使“apple”在其中一个文档中出现的次数少于 3 次。那么如何将结果限制在该术语出现在或高于最低频率的结果中？
我认为您不能为此使用 MLT。最小频率和最小文档频率约束实际上都应用于输入文本而不是比较文档。另一种方法是使用脚本插件在过滤器脚本端实现这一点 - stackoverflow.com/questions/28296320/…
明白了。感谢您的帮助。
我认为mlt查询不支持“percent_terms_to_match”，至少它不适用于ES 2.2
MLT 会在属性值不是文本而是数字数组中起作用吗？如果没有，是否有什么东西可以达到这种效果？我需要使用文档的标签并使用它们来检索具有最多匹配标签（数字）的其他文档

【解决方案2】：

作为这个问题的发布者，我也试图围绕 more_like_this 查询...

我很难在网络上找到好的信息来源，但（在大多数情况下）文档似乎最有帮助，所以这里是 the link to the documentation，以及一些更重要的术语（和/或更多很难理解，所以我添加了我的解释）：

max_query_terms - 将选择的查询词的最大数量（从每个输入文档中）。增加此值会以牺牲查询执行速度为代价提供更高的准确性。默认为 25。

min_term_freq - 最低词频，低于该词频将从输入文档中忽略。默认为 2。

如果该词在输入文档中出现的次数少于 2（默认）次，它将从输入文档中被忽略，即不会在其他可能的 more_like_this 文档中搜索。

min_doc_freq - 低于该词条将从输入文档中被忽略的最小文档频率。默认为 5。

这个花了我一秒钟的时间，所以，这是我的解释：

输入文档中的一个词必须出现在多少个文档中才能被选为查询词。

就是这样，我希望我能挽救一个人的生命几分钟。 :)

干杯！

【讨论】：