【问题标题】:elasticsearch disable term frequency scoringelasticsearch 禁用词频评分
【发布时间】:2015-11-18 11:59:54
【问题描述】:

我想更改 elasticsearch 中的评分系统,以摆脱计算一个术语的多次出现。例如,我想要:

“德州德州”

“德克萨斯”

以相同的分数出来。我发现 elasticsearch 所说的这种映射会禁用词频计数,但我的搜索结果并不相同:

"mappings":{
"business": {   
   "properties" : {
       "name" : {
          "type" : "string",
          "index_options" : "docs",
          "norms" : { "enabled": false}}
        }
    }
}

}

任何帮助将不胜感激,我无法找到很多这方面的信息。

编辑:

我正在添加我的搜索代码以及使用说明时返回的内容。

我的搜索码:

Settings settings = ImmutableSettings.settingsBuilder().put("cluster.name", "escluster").build();
    Client client = new TransportClient(settings)
    .addTransportAddress(new InetSocketTransportAddress("127.0.0.1", 9300));

    SearchRequest request =  Requests.searchRequest("businesses")
            .source(SearchSourceBuilder.searchSource().query(QueryBuilders.boolQuery()
            .should(QueryBuilders.matchQuery("name", "Texas")
            .minimumShouldMatch("1")))).searchType(SearchType.DFS_QUERY_THEN_FETCH);

    ExplainRequest request2 = client.prepareIndex("businesses", "business")

当我用解释搜索时,我得到:

  "took" : 14,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.0,
    "hits" : [ {
      "_shard" : 1,
      "_node" : "BTqBPVDET5Kr83r-CYPqfA",
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9U5KBks4zEorv9YI4n",
      "_score" : 1.0,
      "_source":{
"name" : "texas"
}
,
      "_explanation" : {
        "value" : 1.0,
        "description" : "weight(_all:texas in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 1.0,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(freq=1.0), with freq of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "termFreq=1.0"
            } ]
          }, {
            "value" : 1.0,
            "description" : "idf(docFreq=2, maxDocs=3)"
          }, {
            "value" : 1.0,
            "description" : "fieldNorm(doc=0)"
          } ]
        } ]
      }
    }, {
      "_shard" : 1,
      "_node" : "BTqBPVDET5Kr83r-CYPqfA",
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9U5K6Ks4zEorv9YI4o",
      "_score" : 0.8660254,
      "_source":{
"name" : "texas texas texas"
}
,
      "_explanation" : {
        "value" : 0.8660254,
        "description" : "weight(_all:texas in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 0.8660254,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.7320508,
            "description" : "tf(freq=3.0), with freq of:",
            "details" : [ {
              "value" : 3.0,
              "description" : "termFreq=3.0"
            } ]
          }, {
            "value" : 1.0,
            "description" : "idf(docFreq=2, maxDocs=3)"
          }, {
            "value" : 0.5,
            "description" : "fieldNorm(doc=0)"
          } ]
        } ]
      }
    } ]
  }

看起来它仍在考虑频率和文档频率。有任何想法吗?很抱歉格式不好,我不知道为什么它看起来如此怪诞。

编辑编辑:

我的代码来自浏览器搜索http://localhost:9200/businesses/business/_search?pretty=true&qname=texas 是:

    {
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "hits" : {
    "total" : 4,
    "max_score" : 1.0,
    "hits" : [ {
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9YcCKjKvtg8NgyozGK",
      "_score" : 1.0,
      "_source":{"business" : {
"name" : "texas texas texas texas" }
}
    }, {
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9YateBKvtg8Ngyoy-p",
      "_score" : 1.0,
      "_source":{
"name" : "texas" }

    }, {
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9YavVnKvtg8Ngyoy-4",
      "_score" : 1.0,
      "_source":{
"name" : "texas texas texas" }

    }, {
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9Yb7NgKvtg8NgyozFf",
      "_score" : 1.0,
      "_source":{"business" : {
"name" : "texas texas texas" }
}
    } ]
  }
}

它会找到我在其中的所有 4 个对象,并且它们的分数都相同。 当我使用解释运行我的 java API 搜索时,我得到:

    {
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 3,
    "successful" : 3,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.287682,
    "hits" : [ {
      "_shard" : 1,
      "_node" : "BTqBPVDET5Kr83r-CYPqfA",
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9YateBKvtg8Ngyoy-p",
      "_score" : 1.287682,
      "_source":{
"name" : "texas" }
,
      "_explanation" : {
        "value" : 1.287682,
        "description" : "weight(name:texas in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 1.287682,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.0,
            "description" : "tf(freq=1.0), with freq of:",
            "details" : [ {
              "value" : 1.0,
              "description" : "termFreq=1.0"
            } ]
          }, {
            "value" : 1.287682,
            "description" : "idf(docFreq=2, maxDocs=4)"
          }, {
            "value" : 1.0,
            "description" : "fieldNorm(doc=0)"
          } ]
        } ]
      }
    }, {
      "_shard" : 1,
      "_node" : "BTqBPVDET5Kr83r-CYPqfA",
      "_index" : "businesses",
      "_type" : "business",
      "_id" : "AU9YavVnKvtg8Ngyoy-4",
      "_score" : 1.1151654,
      "_source":{
"name" : "texas texas texas" }
,
      "_explanation" : {
        "value" : 1.1151654,
        "description" : "weight(name:texas in 0) [PerFieldSimilarity], result of:",
        "details" : [ {
          "value" : 1.1151654,
          "description" : "fieldWeight in 0, product of:",
          "details" : [ {
            "value" : 1.7320508,
            "description" : "tf(freq=3.0), with freq of:",
            "details" : [ {
              "value" : 3.0,
              "description" : "termFreq=3.0"
            } ]
          }, {
            "value" : 1.287682,
            "description" : "idf(docFreq=2, maxDocs=4)"
          }, {
            "value" : 0.5,
            "description" : "fieldNorm(doc=0)"
          } ]
        } ]
      }
    } ]
  }
}

【问题讨论】:

  • 不匹配可能更多地与 doc frequency 而不是 term frequency 是否使用 search_type=dfs_query_then_fetch 。如果这无济于事,请尝试在查询中设置 explain=true 以查看得分细目
  • 我将它切换到 dfs_query_then_fetch 但这不起作用。我将在稍后发布我的代码并解释结果
  • 您也可以发布查询吗?
  • 对不起,什么意思?我只是从上面执行 SearchRequest:ActionFuture af = client.search(request);
  • 感谢您的格式编辑!

标签: elasticsearch frequency java term scoring


【解决方案1】:

您的字段类型必须是文本

您必须重新索引 elasticsearch - 创建一个新索引

"mappings": {
    "properties": {
      "text": {
        "type": "text",
        "index_options": "docs"
      }
    }
  }

https://www.elastic.co/guide/en/elasticsearch/reference/current/index-options.html

【讨论】:

    【解决方案2】:

    在映射中初始设置字段后,似乎无法覆盖该字段的 index options

    例子:

    put test
    put test/business/_mapping
    {
    
          "properties": {
             "name": {
                "type": "string",
               "index_options": "freqs",
                "norms": {
                   "enabled": false
                }
             }
          }
    
    }
    put test/business/_mapping
    {
    
          "properties": {
             "name": {
                "type": "string",
                "index_options": "docs",
                "norms": {
                   "enabled": false
                }
             }
          }
    
    }
    get  test/business/_mapping
    
       {
       "test": {
          "mappings": {
             "business": {
                "properties": {
                   "name": {
                      "type": "string",
                      "norms": {
                         "enabled": false
                      },
                      "index_options": "freqs"
                   }
                }
             }
          }
       }
    }
    

    您必须重新创建索引才能获取新映射

    【讨论】:

    • 这很尴尬,那是我自己的愚蠢,我只是用我的浏览器用命令测试:localhost:9200/businesses/…,在我把它改成“qname=texas”之后它就可以了,分数是相同。那么为什么它不适用于我的 java API 搜索,似乎我正在搜索名称字段?
    • 你能粘贴整个sn-p还是用java客户端中的解释设置更好的响应
    • 对不起,我不确定如何在 javaAPI 中设置它,它似乎不是 SearchRequest 的选项。我将使用代码更新我的 OP。
    • 我更改为 SearchResponse 以便能够使用解释、再次更新 OP 并覆盖之前的编辑。看起来当我使用 java API 时,它没有达到应该忽略频率的设置。
    • 奇怪你能在浏览器中试试这个http://localhost:9200/businesses/business/_search?pretty=true&q=name:texas&search_type=dfs_query_then_fetch&explain=true,看看你是否仍然得到相同的分数?我有一种感觉,可能没有应用映射,或者在索引文档后应用了映射
    猜你喜欢
    • 2016-01-17
    • 1970-01-01
    • 2020-02-02
    • 1970-01-01
    • 1970-01-01
    • 2016-04-09
    • 1970-01-01
    • 2015-07-13
    • 2016-05-15
    相关资源
    最近更新 更多