【问题标题】:Why do two identical documents score differently?为什么两个相同的文档得分不同?
【发布时间】:2013-01-29 00:06:08
【问题描述】:

我目前正在研究轮胎宝石(我也是 elasticsearch 和 lucene 的新手)并尝试了一些东西。我需要做一些(可能不是微不足道的)得分,所以我试图抓住它。我阅读了我在网上可以找到的关于评分公式的所有内容,并试图将我找到的内容与解释的查询相匹配。

如果我没看错的话,标题为“foo foo foo foo”的文档会有不同的分数,这肯定不是预期的。我想我在索引期间或之后错过了一个步骤,但我无法弄清楚。

下面是我的代码。我不会完全按照轮胎 DSL 的预期方式进行,因为我想弄清楚 - 一段时间后事情可能看起来更轮胎式。

require 'tire'
require 'pp'

class Model
  INDEX = 'myindex'
  TYPE = 'company'

  class << self
    def delete_index
      Tire.index(INDEX) { delete }
    end

    def create_mapping
      Tire.index INDEX do
        create mappings: {
          TYPE => {
            properties: {
              title: { type: 'string' }
            }
          }
        }
      end
    end

    def refresh_index
      Tire.index INDEX do
        refresh
      end
    end
  end

  def initialize(attributes = {})
    @attributes = attributes.merge(:_id => object_id) #use oid as id, just for testing
  end

  def _type
    TYPE
  end

  def id
    object_id.to_s #convert to string because tire compares to object_id!
  end

  def index
    item = self
    Tire.index INDEX do
      store item
    end
  end

  def to_indexed_json
    @attributes.to_json
  end

  ENTITIES = [
    new(title: "foo foo foo foo"),
    new(title: "foo"),
    new(title: "bar"),
    new(title: "foo bar"),
    new(title: "xxx"),
    new(title: "foo foo foo foo"),
    new(title: "foo foo"),
    new(title: "foo bar baz")
  ]

  QUERIES = {
    :foo => { query_string: { query: "foo" } },
    :all => { match_all: {} }
  }

  def self.custom_explained_search(q)
    Tire.search(Model::INDEX, :wrapper => Model, :explain => true) do |search|
      search.query do |query|
        query.send :instance_variable_set, :@value, q
      end
    end
  end
end

class Tire::Results::Collection
  def explained
    @response["hits"]["hits"].map do |hit|
      {
        "_id" => hit["_id"],
        "_explanation" => hit["_explanation"],
        "title" => hit["_source"]["title"]
      }
    end
  end
end

Model.delete_index
Model.create_mapping
Model::ENTITIES.each &:index
Model.refresh_index
s = Model.custom_explained_search(Model::QUERIES[:foo])
pp s.results.explained

打印出来的结果是这样的:

[{"_id"=>"2169251840",
  "_explanation"=>
   {"value"=>0.54932046,
    "description"=>"fieldWeight(_all:foo in 0), product of:",
    "details"=>
     [{"value"=>1.4142135,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>1.4142135, "description"=>"tf(phraseFreq=2.0)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},
      {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=0)"}]},
  "title"=>"foo foo foo foo"},
 {"_id"=>"2169251720",
  "_explanation"=>
   {"value"=>0.54932046,
    "description"=>"fieldWeight(_all:foo in 1), product of:",
    "details"=>
     [{"value"=>0.70710677,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},
      {"value"=>1.0, "description"=>"fieldNorm(field=_all, doc=1)"}]},
  "title"=>"foo"},
 {"_id"=>"2169250520",
  "_explanation"=>
   {"value"=>0.48553526,
    "description"=>"fieldWeight(_all:foo in 2), product of:",
    "details"=>
     [{"value"=>1.0,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>1.0, "description"=>"tf(phraseFreq=1.0)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},
      {"value"=>0.625, "description"=>"fieldNorm(field=_all, doc=2)"}]},
  "title"=>"foo foo"},
 {"_id"=>"2169251320",
  "_explanation"=>
   {"value"=>0.44194174,
    "description"=>"fieldWeight(_all:foo in 1), product of:",
    "details"=>
     [{"value"=>0.70710677,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>1.0, "description"=>"idf(_all:  foo=1)"},
      {"value"=>0.625, "description"=>"fieldNorm(field=_all, doc=1)"}]},
  "title"=>"foo bar"},
 {"_id"=>"2169250380",
  "_explanation"=>
   {"value"=>0.27466023,
    "description"=>"fieldWeight(_all:foo in 3), product of:",
    "details"=>
     [{"value"=>0.70710677,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>0.70710677, "description"=>"tf(phraseFreq=0.5)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>0.7768564, "description"=>"idf(_all:  foo=4)"},
      {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=3)"}]},
  "title"=>"foo bar baz"},
 {"_id"=>"2169250660",
  "_explanation"=>
   {"value"=>0.2169777,
    "description"=>"fieldWeight(_all:foo in 0), product of:",
    "details"=>
     [{"value"=>1.4142135,
       "description"=>"btq, product of:",
       "details"=>
        [{"value"=>1.4142135, "description"=>"tf(phraseFreq=2.0)"},
         {"value"=>1.0, "description"=>"allPayload(...)"}]},
      {"value"=>0.30685282, "description"=>"idf(_all:  foo=1)"},
      {"value"=>0.5, "description"=>"fieldNorm(field=_all, doc=0)"}]},
  "title"=>"foo foo foo foo"}]

我是不是看错了数字?还是滥用轮胎?也许只是缺少一些“重新索引整个集合”步骤?

【问题讨论】:

  • 我打开了登录并提取了脚本作为一系列 curl 调用。重播,转来转去。如果我使用像curl -X PUT "http://localhost:9200/myindex/company/2229231160" -d '{"title":"foo foo foo foo","_id":2229231160}' 这样的长_id 和像curl -X PUT "http://localhost:9200/myindex/company/6" -d '{"title":"foo foo foo foo","_id":6}' 这样的短_id,似乎会有所不同。对我来说似乎是一个错误。
  • 你用的是什么版本的elasticsearch

标签: elasticsearch tire


【解决方案1】:

afaik 如果没有定义明确的排序字段,排序默认为(的变体) tf * idf (http://en.wikipedia.org/wiki/Tf*idf) 。

字面意思:词频*逆文档频率。

来自维基百科:

词频(词条计数):给定文档中的词条计数只是给定词条在该文档中出现的次数

逆文档频率是衡量该术语在所有文档中是常见还是罕见的量度。它是通过将文档总数除以包含该术语的文档数,然后取该商的对数得到的

在这种情况下,排序的“词频”分量最有可能导致“foo foo foo foo”在搜索“foo”时得分高于其他文档

此外,关于您在更改 id 时看到的效果:我不确定,但我猜它必须这样做 ES 在内部存储由id 订购的文档(我不确定) ...

如果是这样,具有相同排序分数的 2 个文档将根据 id 进行排序,作为决胜局。您当然可以定义多种排序来改变这种行为(例如:sort=sorta+desc、sortb+desc。在这种情况下,sortb 被用作所有在 scoreA 上得分相同的文档的决胜局)

【讨论】:

  • hmm 我想我看错了你的帖子,因为你说的是​​ 2 篇标题:“foo foo foo foo”的帖子得分不同?如果是这样的话,我不得分得分差异来自哪里
  • tf*idf 是正确的,但值得注意的是,除非您使用 dfs queries,否则这是本地分片上的文档频率,而不是整个索引上的文档频率...
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2014-08-22
  • 2011-12-15
  • 1970-01-01
  • 2020-07-30
  • 2016-05-12
相关资源
最近更新 更多