【问题标题】:Multifield wildcard search in ElasticSearchElasticSearch 中的多字段通配符搜索
【发布时间】:2020-07-21 18:20:09
【问题描述】:

考虑这个非常基本的 T-SQL 查询:

select * from Users
where FirstName like '%dm0e776467@mail.com%' 
or LastName like '%dm0e776467@mail.com%' 
or Email like '%dm0e776467@mail.com%'

如何在 Lucene 中编写这个?

我尝试了以下方法:

  1. 查询方式(根本不起作用,没有结果):

    { “询问”: { “布尔”:{ “应该”: [ { “通配符”:{ “名字”:“dm0e776467@mail.com” } }, { “通配符”:{ “姓氏”:“dm0e776467@mail.com” } }, { “通配符”:{ “电子邮件”:“dm0e776467@mail.com” } }
    ] } } }

  2. Multimatch 方式(返回存在 mail.com 的任何内容)

    { “询问”: { “多匹配”:{ "查询": "dm0e776467@mail.com", “领域”:[ “名”, “姓”, “电子邮件” ] } } }

  3. 第三次尝试(返回预期结果,但如果我只插入“邮件”,则不返回任何结果)

    { “询问”: { “请求参数”: { "查询": ""dm0e776467@mail.com"", “领域”:[ “名”, “姓”, “电子邮件” ], "default_operator": "或", “allow_leading_wildcard”:真 } } }

在我看来,没有办法强制 Elasticsearch 强制查询使用输入字符串作为 ONE 子字符串?

【问题讨论】:

    标签: tsql elasticsearch full-text-search


    【解决方案1】:

    standard(默认)分析器将按如下方式标记此电子邮件:

    GET _analyze
    {
      "text": "dm0e776467@mail.com",
      "analyzer": "standard"
    }
    

    屈服

    {
      "tokens" : [
        {
          "token" : "dm0e776467",
          ...
        },
        {
          "token" : "mail.com",
          ...
        }
      ]
    }
    

    这解释了为什么多重匹配适用于任何 *mail.com 后缀以及通配符失败的原因。


    this answer 的启发,我建议对您的映射进行以下修改:

    PUT users
    {
      "settings": {
        "analysis": {
          "filter": {
            "email": {
              "type": "pattern_capture",
              "preserve_original": true,
              "patterns": [
                "([^@]+)",
                "(\\p{L}+)",
                "(\\d+)",
                "@(.+)",
                "([^-@]+)"
              ]
            }
          },
          "analyzer": {
            "email": {
              "tokenizer": "uax_url_email",
              "filter": [
                "email",
                "lowercase",
                "unique"
              ]
            }
          }
        }
      },
      "mappings": {
        "properties": {
          "email": {
            "type": "text",
            "analyzer": "email"
          },
          "firstName": {
            "type": "text",
            "fields": {
              "as_email": {
                "type": "text",
                "analyzer": "email"
              }
            }
          },
          "lastName": {
            "type": "text",
            "fields": {
              "as_email": {
                "type": "text",
                "analyzer": "email"
              }
            }
          }
        }
      }
    }
    

    请注意,我在您的 first-lastName 字段中使用了 .as_email 字段——您可能不想强制它们默认映射为电子邮件。

    然后在索引几个样本之后:

    POST _bulk
    {"index":{"_index":"users","_type":"_doc"}}
    {"firstName":"abc","lastName":"adm0e776467@mail.coms","email":"dm0e776467@mail.com"}
    {"index":{"_index":"users","_type":"_doc"}}
    {"firstName":"xyz","lastName":"opr","email":"dm0e776467@mail.com"}
    {"index":{"_index":"users","_type":"_doc"}}
    {"firstName":"zyx","lastName":"dm0e776467@mail.com","email":"qwe"}
    {"index":{"_index":"users","_type":"_doc"}}
    {"firstName":"abc","lastName":"efg","email":"ijk"}
    

    通配符工作得很好:

    GET users/_search
    {
      "query": {
        "bool": {
          "should": [
            {
              "wildcard": {
                "email": "dm0e776467@mail.com"
              }
            },
            {
              "wildcard": {
                "lastName.as_email": "dm0e776467@mail.com"
              }
            },
            {
              "wildcard": {
                "firstName.as_email": "dm0e776467@mail.com"
              }
            }
          ]
        }
      }
    }
    

    请检查此标记器如何在后台工作以防止“令人惊讶”的查询结果:

    GET users/_analyze
    {
      "text": "dm0e776467@mail.com",
      "field": "email"
    }
    

    【讨论】:

      猜你喜欢
      • 2018-08-08
      • 1970-01-01
      • 1970-01-01
      • 2018-12-06
      • 2015-03-15
      • 2014-02-05
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多