Elasticsearch 多词同义词结构解释答案

【问题标题】：Elasticsearch multi-word synonyms structure explanationElasticsearch 多词同义词结构解释
【发布时间】：2019-09-12 11:16:55
【问题描述】：

我使用 ElasticSearch 中的 synonym_graph 功能，它似乎工作正常。

我试图通过使用直接测试分析器来直观地理解新 synonym_graph 的工作原理和拆分单词的方式

GET my_index/_analyze
{
  "text": "I really love eating lots and lots of fried cheese",
  "analyzer": "my_analyzer"
}

我想知道分析器的输出是什么意思。

在这个例子中，术语“fried cheese”有几个定义的同义词，其中一些是多词，一些是单个词

fried cheese => fried cheese, mozzarellasticks, Queso Frito, cheesecurd, friedmozzarella

分析器的输出是

{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "realli",
      "start_offset" : 2,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "love",
      "start_offset" : 9,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "eat",
      "start_offset" : 14,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "lot",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "lot",
      "start_offset" : 30,
      "end_offset" : 34,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "friedchees",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 4
    },
    {
      "token" : "fri",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7
    },
    {
      "token" : "mozzarellastick",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 4
    },
    {
      "token" : "queso",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 2
    },
    {
      "token" : "cheesecurd",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 4
    },
    {
      "token" : "friedmozzarella",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 4
    },
    {
      "token" : "fri",
      "start_offset" : 38,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7,
      "positionLength" : 3
    },
    {
      "token" : "chees",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 8,
      "positionLength" : 3
    },
    {
      "token" : "frito",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 9,
      "positionLength" : 2
    },
    {
      "token" : "chees",
      "start_offset" : 44,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

我正在尝试了解此结果中同义词标记的参数。让我们以同义词“Queso Frito”为例

{
      "token" : "frito",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 9,
      "positionLength" : 2
    }
{
      "token" : "queso",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 2
    }

所有附加参数的含义是什么？ “start_offset”、“end_offset”、“position”、“positionLength”

【问题讨论】：

标签： elasticsearch analyzer synonym

【解决方案1】：

所有附加参数的含义是什么？ "start_offset", “end_offset”、“位置”、“位置长度”

start_offset 是标记的开始，即字符 38（fried 在整个句子中从字符 38 开始）

end_offset 是标记的结尾，即字符 50（cheese 在整个句子中以 char 50 结尾）

position 是令牌的位置。注意它是如何从i 不断增加为0 等等。

positionLength 是令牌跨越的位置。

没有太多的文档。唯一最接近的就是来自elastic docs。

token.position (int, read-only)
        The position of the current token
token.positionIncrement (int, read-only)
         The position increment of the current token
token.positionLength (int, read-only)
         The position length of the current token
token.startOffset (int, read-only)
         The start offset of the current token
token.endOffset (int, read-only)
         The end offset of the current token

Some additional reading，如果你愿意的话。

【讨论】：