【问题标题】:Elasticsearch multi-word synonyms structure explanationElasticsearch 多词同义词结构解释
【发布时间】:2019-09-12 11:16:55
【问题描述】:

我使用 ElasticSearch 中的 synonym_graph 功能,它似乎工作正常。

我试图通过使用直接测试分析器来直观地理解新 synonym_graph 的工作原理和拆分单词的方式

GET my_index/_analyze
{
  "text": "I really love eating lots and lots of fried cheese",
  "analyzer": "my_analyzer"
}

我想知道分析器的输出是什么意思。

在这个例子中,术语“fried cheese”有几个定义的同义词,其中一些是多词,一些是单个词

fried cheese => fried cheese, mozzarellasticks, Queso Frito, cheesecurd, friedmozzarella

分析器的输出是

{
  "tokens" : [
    {
      "token" : "i",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "realli",
      "start_offset" : 2,
      "end_offset" : 8,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "love",
      "start_offset" : 9,
      "end_offset" : 13,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "eat",
      "start_offset" : 14,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "lot",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "lot",
      "start_offset" : 30,
      "end_offset" : 34,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "friedchees",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 4
    },
    {
      "token" : "fri",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7
    },
    {
      "token" : "mozzarellastick",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 4
    },
    {
      "token" : "queso",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 2
    },
    {
      "token" : "cheesecurd",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 4
    },
    {
      "token" : "friedmozzarella",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 4
    },
    {
      "token" : "fri",
      "start_offset" : 38,
      "end_offset" : 43,
      "type" : "<ALPHANUM>",
      "position" : 7,
      "positionLength" : 3
    },
    {
      "token" : "chees",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 8,
      "positionLength" : 3
    },
    {
      "token" : "frito",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 9,
      "positionLength" : 2
    },
    {
      "token" : "chees",
      "start_offset" : 44,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

我正在尝试了解此结果中同义词标记的参数。 让我们以同义词“Queso Frito”为例

{
      "token" : "frito",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 9,
      "positionLength" : 2
    }
{
      "token" : "queso",
      "start_offset" : 38,
      "end_offset" : 50,
      "type" : "SYNONYM",
      "position" : 7,
      "positionLength" : 2
    }

所有附加参数的含义是什么? “start_offset”、“end_offset”、“position”、“positionLength”

【问题讨论】:

    标签: elasticsearch analyzer synonym


    【解决方案1】:

    所有附加参数的含义是什么? "start_offset", “end_offset”、“位置”、“位置长度”

    start_offset 是标记的开始,即字符 38(fried 在整个句子中从字符 38 开始)

    end_offset 是标记的结尾,即字符 50(cheese 在整个句子中以 char 50 结尾)

    position 是令牌的位置。注意它是如何从i 不断增加为0 等等。

    positionLength 是令牌跨越的位置。

    没有太多的文档。唯一最接近的就是来自elastic docs

    token.position (int, read-only)
            The position of the current token
    token.positionIncrement (int, read-only)
             The position increment of the current token
    token.positionLength (int, read-only)
             The position length of the current token
    token.startOffset (int, read-only)
             The start offset of the current token
    token.endOffset (int, read-only)
             The end offset of the current token
    

    Some additional reading,如果你愿意的话。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2019-02-22
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多