与 ElasticSearch 的精确文档匹配答案

【问题标题】：Exact document matching with ElasticSearch与 ElasticSearch 的精确文档匹配
【发布时间】：2013-03-10 22:51:28
【问题描述】：

我需要准确地查询一组“短文档”。示例：

文件：

{"name": "John Doe", "alt": "John W Doe"}
{"name": "我的朋友 John Doe", "alt": "John A Doe"}
{"name": "John", "alt": "Susy"}
{"name": "Jack", "alt": "John Doe"}

预期结果：

如果我搜索“John Doe”，我希望 1 的分数远大于 2 和 4 的分数
如果我搜索“John Doé”，同上
如果我搜索“John”，我想得到 3（完全匹配优于重复名称和 alt）

ES可以吗？我怎样才能做到这一点？我尝试提升“名称”，但我找不到如何完全匹配文档字段，而不是在其中搜索。

【问题讨论】：

标签： lucene elasticsearch

【解决方案1】：

您所描述的正是搜索引擎默认的工作方式。对"John Doe" 的搜索变成对术语"john" 和"doe" 的搜索。对于每个术语，它会查找包含该术语的文档，然后为每个文档分配一个_score，基于：

该术语在所有文档中的常见程度（更常见 == 相关性较低）
该术语在文档字段中的常见程度（更常见 == 更相关）
文档的字段有多长（较长 == 相关性较低）

您没有看到明确结果的原因是 Elasticsearch 是分布式的，并且您正在使用少量数据进行测试。默认情况下，一个索引有 5 个主分片，并且您的文档在不同的分片上建立索引。每个分片都有自己的文档频率计数，因此分数被扭曲了。

当您添加真实世界的数据量时，频率甚至会超出分片，但要测试少量数据，您需要做以下两件事之一：

创建一个只有一个主分片的索引，或者
指定 search_type=dfs_query_then_fetch 在使用全局频率运行查询之前首先从每个分片中获取频率

为了演示，首先索引您的数据：

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
   "alt" : "John W Doe",
   "name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
{
   "alt" : "John A Doe",
   "name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
{
   "alt" : "Susy",
   "name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
{
   "alt" : "John Doe",
   "name" : "Jack"
}
'

现在，搜索"john doe"，记得指定dfs_query_then_fetch。

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john doe"
      }
   }
}
'

Doc 1 是结果中的第一个：

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 1.0189849,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.81518793,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 0.3066778,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1.0189849,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 8
# }

当你只搜索"john":

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john"
      }
   }
}
'

Doc 3 首先出现：

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 1,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 0.625,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.5,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 5
# }

忽略重音

第二个问题是匹配"John Doé"。这是一个分析的问题。为了使全文更易于搜索，我们将其分析分开存储在索引中的术语或标记。当用户搜索john 时，为了匹配例如john、John 和JOHN，每个术语/标记都经过多个标记过滤器, 把它们变成标准形式。

当我们进行全文搜索时，搜索词会经历同样的过程。因此，如果我们有一个包含John 的文档，它的索引为john，如果用户搜索JOHN，我们实际上搜索的是john。

为了使Doé 匹配doe，我们需要一个去除重音符号的过滤器，并且我们需要将它应用于被索引的文本和搜索词。最简单的方法是使用ASCII folding token filter。

我们可以在创建索引时定义自定义分析器，并且可以在映射中指定特定字段在索引时和搜索时都应使用该分析器。

首先，删除旧索引：

curl -XDELETE 'http://127.0.0.1:9200/test/?pretty=1'

然后创建索引，指定自定义分析器和映射：

curl -XPUT 'http://127.0.0.1:9200/test/?pretty=1'  -d '
{
   "settings" : {
      "analysis" : {
         "analyzer" : {
            "no_accents" : {
               "filter" : [
                  "standard",
                  "lowercase",
                  "asciifolding"
               ],
               "type" : "custom",
               "tokenizer" : "standard"
            }
         }
      }
   },
   "mappings" : {
      "test" : {
         "properties" : {
            "name" : {
               "type" : "string",
               "analyzer" : "no_accents"
            }
         }
      }
   }
}
'

重新索引数据：

curl -XPUT 'http://127.0.0.1:9200/test/test/1?pretty=1'  -d '
{
   "alt" : "John W Doe",
   "name" : "John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/2?pretty=1'  -d '
{
   "alt" : "John A Doe",
   "name" : "My friend John Doe"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/3?pretty=1'  -d '
{
   "alt" : "Susy",
   "name" : "John"
}
'
curl -XPUT 'http://127.0.0.1:9200/test/test/4?pretty=1'  -d '
{
   "alt" : "John Doe",
   "name" : "Jack"
}
'

现在，测试搜索：

curl -XGET 'http://127.0.0.1:9200/test/test/_search?pretty=1&search_type=dfs_query_then_fetch'  -d '
{
   "query" : {
      "match" : {
         "name" : "john doé"
      }
   }
}
'

# {
#    "hits" : {
#       "hits" : [
#          {
#             "_source" : {
#                "alt" : "John W Doe",
#                "name" : "John Doe"
#             },
#             "_score" : 1.0189849,
#             "_index" : "test",
#             "_id" : "1",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "John A Doe",
#                "name" : "My friend John Doe"
#             },
#             "_score" : 0.81518793,
#             "_index" : "test",
#             "_id" : "2",
#             "_type" : "test"
#          },
#          {
#             "_source" : {
#                "alt" : "Susy",
#                "name" : "John"
#             },
#             "_score" : 0.3066778,
#             "_index" : "test",
#             "_id" : "3",
#             "_type" : "test"
#          }
#       ],
#       "max_score" : 1.0189849,
#       "total" : 3
#    },
#    "timed_out" : false,
#    "_shards" : {
#       "failed" : 0,
#       "successful" : 5,
#       "total" : 5
#    },
#    "took" : 6
# }

【讨论】：

【解决方案2】：

我认为，如果您映射为多个字段，并提升非分析字段，您将实现您所需要的：

 "name": {
            "type": "multi_field",
            "fields": {
                "untouched": {
                    "type": "string",
                    "index": "not_analyzed",
                    "boost": "1.1"
                },
                "name": {
                    "include_in_all": true,
                    "type": "string",
                    "index": "analyzed",
                    "search_analyzer": "someanalyzer",
                    "index_analyzer": "someanalyzer"
                }
            }
        }

如果您需要灵活性，也可以通过在 query_string 中使用 '^'-notation 来增加查询时间而不是索引时间

{
    "query_string" : {
        "fields" : ["name, name.untouched^5"],
        "query" : "this AND that OR thus",
    }
}

【讨论】：