elasticsearch 模糊匹配 max_expansions & min_similarity答案

【问题标题】：elasticsearch fuzzy matching max_expansions & min_similarityelasticsearch 模糊匹配 max_expansions & min_similarity
【发布时间】：2011-11-01 04:39:04
【问题描述】：

我在我的项目中使用模糊匹配主要是为了查找拼写错误和同名的不同拼写。我需要准确理解弹性搜索的模糊匹配是如何工作的，以及它是如何使用标题中提到的两个参数的。

据我了解，min_similarity 是查询字符串与数据库中的字符串匹配的百分比。我找不到有关如何计算此值的确切说明。

据我了解，max_expansions 是应该执行搜索的 Levenshtein 距离。如果这实际上是 Levenshtein 距离，那对我来说将是理想的解决方案。无论如何，它不起作用例如我有“Samvel”这个词

queryStr      max_expansions         matches?
samvel        0                      Should not be 0. error (but levenshtein distance   can be 0!)
samvel        1                      Yes
samvvel       1                      Yes
samvvell      1                      Yes (but it shouldn't have)
samvelll      1                      Yes (but it shouldn't have)
saamvelll     1                      No (but for some weird reason it matches with Samvelian)
saamvelll     anything bigger than 1 No

文档中说了一些我实际上不明白的内容：

Add max_expansions to the fuzzy query allowing to control the maximum number 
of terms to match. Default to unbounded (or bounded by the max clause count in 
boolean query).

那么请任何人向我解释一下这些参数究竟是如何影响搜索结果的。

【问题讨论】：

标签： elasticsearch fuzzy-search fuzzy-logic fuzzy-comparison

【解决方案1】：

min_similarity 是一个介于 0 和 1 之间的值。来自 Lucene 文档：

For example, for a minimumSimilarity of 0.5 a term of the same length 
as the query term is considered similar to the query term if the edit 
distance between both terms is less than length(term)*0.5

所指的“编辑距离”是Levenshtein distance。

这个查询在内部工作的方式是：

在考虑到min_similarity 时，它会查找索引中存在的与搜索词匹配的所有词
然后它会搜索所有这些术语。

你可以想象这个查询有多么沉重！

为了解决这个问题，您可以设置max_expansions 参数来指定应该考虑的最大匹配词数。

【讨论】：

啊，那么 max_expansions 和 min_similarity 应该一起使用。所以实际的距离限制是由min_similarity 和max_expansions 完成的，就像MySQL 的LIMIT 子句一样？它只是限制了潜在结果的数量？
是的，它的工作方式类似于 LIMIT 子句，不是在运行的最终查询上，而是在用于查找要在最终查询中搜索的术语列表的临时查询上
非常感谢 :) 这很有帮助 :)
@DrTech 您说“它会找到索引中存在的所有可以匹配搜索词的词”。这实际上不是意味着什么是模糊搜索吗？您说“然后它会搜索所有这些术语”。它已经在第一步对吗？在索引中查找可以匹配搜索词的词不是搜索的意图吗？