使用 ElasticSearch 进行 Smartcase 搜索/突出显示答案

【问题标题】：Smartcase searches/highlights with ElasticSearch使用 ElasticSearch 进行 Smartcase 搜索/突出显示
【发布时间】：2016-04-30 21:53:52
【问题描述】：

上下文

我试图在我们使用弹性搜索的应用程序中支持smart-case search。我想要支持的用例是能够使用 smart-case 语义对任何文本块进行部分匹配。我设法以能够模拟智能案例搜索的方式配置我的索引。它使用最大长度为 8 的 ngram 来避免超载存储需求。

它的工作方式是每个文档都有一个生成的case-sensitive 和一个case-insensitive 字段，使用copy_to 和它们自己的特定索引策略。在搜索给定的输入时，我将输入分成几部分。这取决于 ngram 长度、空格和双引号转义。每个部分都会检查大写字母。当找到大写字母时，它会使用区分大小写的字段为该特定部分生成匹配过滤器，否则使用不区分大小写的字段。

事实证明，这非常有效，但是我无法按照我想要的方式突出显示。为了更好地解释这个问题，我在下面添加了我的测试设置的概述。

设置

curl -X DELETE localhost:9200/custom
curl -X PUT    localhost:9200/custom -d '
{
  "settings": {
    "analysis": {
      "filter": {
        "default_min_length": {
          "type": "length",
          "min": 1
        },
        "squash_spaces": {
          "type": "pattern_replace",
          "pattern": "\\s{2,}",
          "replacement": " "
        }
      },
      "tokenizer": {
        "ngram_tokenizer": {
          "type": "nGram",
          "min_gram": "2",
          "max_gram": "8"
        }
      },
      "analyzer": {
        "index_raw": {
          "type": "custom",
          "filter": ["lowercase","squash_spaces","trim","default_min_length"],
          "tokenizer": "keyword"
        },
        "index_case_insensitive": {
          "type": "custom",
          "filter": ["lowercase","squash_spaces","trim","default_min_length"],
          "tokenizer": "ngram_tokenizer"
        },
        "search_case_insensitive": {
          "type": "custom",
          "filter": ["lowercase","squash_spaces","trim"],
          "tokenizer": "keyword"
        },
        "index_case_sensitive": {
          "type": "custom",
          "filter": ["squash_spaces","trim","default_min_length"],
          "tokenizer": "ngram_tokenizer"
        },
        "search_case_sensitive": {
          "type": "custom",
          "filter": ["squash_spaces","trim"],
          "tokenizer": "keyword"
        }
      }
    }
  },
  "mappings": {
    "_default_": {
      "_all": { "enabled": false },
      "date_detection": false,
      "dynamic_templates": [
        {
          "case_insensitive": {
            "match_mapping_type": "string",
            "match": "case_insensitive",
            "mapping": {
              "type": "string",
              "analyzer": "index_case_insensitive",
              "search_analyzer": "search_case_insensitive"
            }
          }
        },
        {
          "case_sensitive": {
            "match_mapping_type": "string",
            "match": "case_sensitive",
            "mapping": {
              "type": "string",
              "analyzer": "index_case_sensitive",
              "search_analyzer": "search_case_sensitive"
            }
          }
        },
        {
          "text": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "string",
              "analyzer": "index_raw",
              "copy_to": ["case_insensitive","case_sensitive"],
              "fields": {
                "case_insensitive": {
                  "type": "string",
                  "analyzer": "index_case_insensitive",
                  "search_analyzer": "search_case_insensitive",
                  "term_vector": "with_positions_offsets"
                },
                "case_sensitive": {
                  "type": "string",
                  "analyzer": "index_case_sensitive",
                  "search_analyzer": "search_case_sensitive",
                  "term_vector": "with_positions_offsets"
                }
              }
            }
          }
        }
      ]
    }
  }
}
'

数据

curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis .is a! Test" }'

查询

用户搜索：tHis test，它被分成两部分，因为 ngram 的长度最大为 8：（1）tHis 和（2）test。对于 (1) 使用区分大小写的字段和 (2) 使用不区分大小写的字段。

curl -X POST "http://localhost:9200/_search" -d '
{
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "case_sensitive": {
              "query": "tHis",
              "type": "boolean"
            }
          }
        },
        {
          "match": {
            "case_insensitive": {
              "query": "test",
              "type": "boolean"
            }
          }
        }
      ]
    }
  },
  "highlight": {
    "pre_tags": [
      "<em>"
    ],
    "post_tags": [
      "</em>"
    ],
    "number_of_fragments": 0,
    "require_field_match": false,
    "fields": {
      "*": {}
    }
  }
}
'

响应

{
  "took": 10,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.057534896,
    "hits": [
      {
        "_index": "custom",
        "_type": "test",
        "_id": "1",
        "_score": 0.057534896,
        "_source": {
          "text": "tHis .is a! Test"
        },
        "highlight": {
          "text.case_sensitive": [
            "<em>tHis</em> .is a! Test"
          ],
          "text.case_insensitive": [
            "tHis .is a!<em> Test</em>"
          ]
        }
      }
    ]
  }
}

问题：突出显示

如您所见，响应显示 smart-case 搜索运行良好。但是，我还想使用突出显示向用户提供反馈。我当前的设置使用"term_vector": "with_positions_offsets" 来生成高光。这确实给出了正确的亮点。但是，突出显示分别以区分大小写和不区分大小写的形式返回。

"highlight": {
  "text.case_sensitive": [
    "<em>tHis</em> .is a! Test"
  ],
  "text.case_insensitive": [
    "tHis .is a!<em> Test</em>"
  ]
}

这需要我手动将同一字段上的多个高亮压缩成一个组合高亮，然后再将其返回给用户。当高光变得更加复杂并且可能重叠时，这将变得非常痛苦。

问题

是否有替代设置来实际恢复组合亮点。 IE。我想将此作为我的回复的一部分。

"highlight": {
  "text": [
    "<em>tHis</em> .is a!<em> Test</em>"
  ]
}

【问题讨论】：

如果对高亮部分使用单独的查询并使用“不敏感”部分进行高亮匹配怎么办？我将在答案中发布查询。

标签： elasticsearch full-text-search full-text-indexing

【解决方案1】：

尝试

利用高亮查询得到合并结果：

curl -XPOST 'http://localhost:9200_search' -d '
{
  "size": 1,
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "case_sensitive": {
              "query": "tHis",
              "type": "boolean"
            }
          }
        },
        {
          "match": {
            "case_insensitive": {
              "query": "test",
              "type": "boolean"
            }
          }
        }
      ]
    }
  },
  "highlight": {
    "pre_tags": [
      "<em>"
    ],
    "post_tags": [
      "</em>"
    ],
    "number_of_fragments": 0,
    "require_field_match": false,
    "fields": {
      "*.case_insensitive": {
        "highlight_query": {
          "bool": {
            "must": [
              {
                "match": {
                  "*.case_insensitive": {
                    "query": "tHis",
                    "type": "boolean"
                  }
                }
              },
              {
                "match": {
                  "*.case_insensitive": {
                    "query": "test",
                    "type": "boolean"
                  }
                }
              }
            ]
          }
        }
      }
    }
  }
}
'

响应

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.9364339,
    "hits": [
      {
        "_index": "custom",
        "_type": "test",
        "_id": "1",
        "_score": 0.9364339,
        "_source": {
          "text": "tHis .is a! Test"
        },
        "highlight": {
          "text.case_insensitive": [
            "<em>tHis</em> .is a!<em> Test</em>"
          ]
        }
      }
    ]
  }
}

警告

摄取以下内容时，请注意附加的小写 test 关键字：

curl -X POST "http://localhost:9200/custom/test" -d '{ "text" : "tHis this .is a! Test" }'

对同一查询的响应变为：

{
  "took": 5,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.9364339,
    "hits": [
      {
        "_index": "custom",
        "_type": "test",
        "_id": "1",
        "_score": 0.9364339,
        "_source": {
          "text": "tHis this .is a! Test"
        },
        "highlight": {
          "text.case_insensitive": [
            "<em>tHis</em><em> this</em> .is a!<em> Test</em>"
          ]
        }
      }
    ]
  }
}

如您所见，突出显示现在还包括小写的this。对于这样的测试示例，我们不介意。然而，对于复杂的查询，用户可能（并且很可能会）混淆智能案例何时以及如何产生影响。尤其是当小写匹配包含一个只匹配小写的字段时。

结论

此解决方案会将所有亮点合并为一个，但可能包含不需要的结果。

【讨论】：

这确实是我尝试过的事情之一，并且给了我最好的结果。然而，这种方法仍然存在一个问题：它会给你所有你想要的亮点，但也可能会给你更多。我将解释添加为您回答的编辑。
嗯。你是对的。与智能大小写不匹配的文本可能会匹配小写。
接受你的回答，因为它是唯一的，而且很有帮助。