如何将标准标记器与preserve_original一起使用？答案

【问题标题】：How to use standard tokenizer with preserve_original?如何将标准标记器与preserve_original一起使用？
【发布时间】：2018-10-19 21:21:14
【问题描述】：

我创建了 2 个自定义分析器，如下所示，但两者都没有按我的意愿工作。这是我想要的倒排索引例如;对于reb-tn2000xxxl这个词，我需要 reb, tn2000xxl, reb-tn2000xxxl 在我的倒排索引中。

{  
   "analysis":{  
      "filter":{  
         "my_word_delimiter":{  
            "split_on_numerics":"true",
            "generate_word_parts":"true",
            "preserve_original":"true",
            "generate_number_parts":"true",
            "catenate_all":"true",
            "split_on_case_change":"true",
            "type":"word_delimiter"
         }
      },
      "analyzer":{  
         "my_analyzer":{  
            "filter":[  
               "standard",
               "lowercase",
               "my_word_delimiter"
            ],
            "type":"custom",
            "tokenizer":"whitespace"
         },
         "standard_caseinsensitive":{  
            "filter":[  
               "standard",
               "lowercase"
            ],
            "type":"custom",
            "tokenizer":"keyword"
         },
         "my_delimiter":{  
            "filter":[  
               "lowercase",
               "my_word_delimiter"
            ],
            "type":"custom",
            "tokenizer":"standard"
         }
      }
   }
}

如果我使用实现whitespace 标记器的my_analyzer，如果我使用curl 检查，结果如下所示

  curl -XGET "index/_analyze?analyzer=my_analyzer&pretty=true" -d "reb-tn2000xxxl"
{
  "tokens" : [ {
    "token" : "reb-tn2000xxxl",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "reb",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "rebtn2000xxxl",
    "start_offset" : 0,
    "end_offset" : 14,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "tn",
    "start_offset" : 4,
    "end_offset" : 6,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "2000",
    "start_offset" : 6,
    "end_offset" : 10,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "xxxl",
    "start_offset" : 10,
    "end_offset" : 14,
    "type" : "word",
    "position" : 3
  } ]
}

所以在这里我缺少tn2000xxxlsplit，如果我使用standard 标记器而不是whitespace，则可以得到它，但问题是一旦我使用my_delimiter 自定义分析器之类的标准。我在倒排索引中没有原始值。似乎standard tokinezer 和preserve_original 过滤器一起不起作用。我在某处读到，因为在应用过滤器之前标准标记器已经在原始上拆分，这就是为什么原始不再是相同的。但是我怎样才能完成这项任务以防止在像标准标记器一样分裂时进行原创？

curl -XGET "index/_analyze?analyzer=my_delimiter&pretty=true" -d "reb-tn2000xxxl"
{  
   "tokens":[  
      {  
         "token":"reb",
         "start_offset":0,
         "end_offset":3,
         "type":"<ALPHANUM>",
         "position":0
      },
      {  
         "token":"tn2000xxxl",
         "start_offset":4,
         "end_offset":14,
         "type":"<ALPHANUM>",
         "position":1
      },
      {  
         "token":"tn",
         "start_offset":4,
         "end_offset":6,
         "type":"<ALPHANUM>",
         "position":1
      },
      {  
         "token":"tn2000xxxl",
         "start_offset":4,
         "end_offset":14,
         "type":"<ALPHANUM>",
         "position":1
      },
      {  
         "token":"2000",
         "start_offset":6,
         "end_offset":10,
         "type":"<ALPHANUM>",
         "position":2
      },
      {  
         "token":"xxxl",
         "start_offset":10,
         "end_offset":14,
         "type":"<ALPHANUM>",
         "position":3
      }
   ]
}

【问题讨论】：

标签： elasticsearch

【解决方案1】：

在 Elasticsearch 中，您可以在映射中包含多个字段。您所描述的行为实际上很常见。您可以使用standard 分析器和keyword 字段分析您的主要text 字段。这是使用文档中的多字段的示例映射。 https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { 
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

在此示例中，"city" 字段将使用 standard 分析器进行分析，"city.raw" 将是未分析的 keyword。换句话说，"city.raw" 是原始字符串。

【讨论】：

所以你不能没有第二个字段？因为查询时间，你必须同时查询组合结果与一些应该和“和”运算符我相信这可能会带来不同的结果
可以同时查询两个字段。您可以将multi_match 或query_string 或bool 查询与多个must 或should 子句一起使用。 Elasticsearch documentation 中有很多选项。
那是原始的关键字吗？或者你可以使用任何其他词？因为当我使用其他东西时，ES 索引器会为我返回错误。我的意思是，如果我需要一个字段的第三个版本，我可以调用 .raw2 吗？
"city.raw" 属于 keyword 类型，但可以任意命名。如果您在映射方面需要帮助，请随时发布另一个问题或再次查看多字段文档。有时需要读三四遍。 :)