【问题标题】:elasticsearch term aggregation over not_analyzed string returns buckets with very low doc count基于 not_analyzed 字符串的 elasticsearch 术语聚合返回文档计数非常低的存储桶
【发布时间】:2016-09-22 22:22:43
【问题描述】:

在使用 AWS Elasticsearch (2.3) 时,我加载了一些示例数据 https://www.elastic.co/guide/en/kibana/3.0/snippets/shakespeare.json 具有以下映射

$ curl --url "https://my_es_id.us-east-1.es.amazonaws.com/shakespeare/_mapping"

{
    "shakespeare": {
        "mappings": {
            "act": {
                "properties": {
                    "line_id": {
                        "type": "integer"
                    },
                    "line_number": {
                        "type": "string"
                    },
                    "play_name": {
                        "fields": {
                            "raw": {
                                "index": "not_analyzed",
                                "type": "string"
                            }
                        },
                        "type": "string"
                    },
                    "speaker": {
                        "fields": {
                            "raw": {
                                "index": "not_analyzed",
                                "type": "string"
                            }
                        },
                        "type": "string"
                    },
                    "speech_number": {
                        "type": "integer"
                    },
                    "text_entry": {
                        "type": "string"
                    }
                }
            },
            "line": {
                "properties": {
                    "line_id": {
                        "type": "integer"
                    },
                    "line_number": {
                        "type": "string"
                    },
                    "play_name": {
                        "type": "string"
                    },
                    "speaker": {
                        "type": "string"
                    },
                    "speech_number": {
                        "type": "integer"
                    },
                    "text_entry": {
                        "type": "string"
                    }
                }
            },
            "scene": {
                "properties": {
                    "line_id": {
                        "type": "integer"
                    },
                    "line_number": {
                        "type": "string"
                    },
                    "play_name": {
                        "type": "string"
                    },
                    "speaker": {
                        "type": "string"
                    },
                    "speech_number": {
                        "type": "integer"
                    },
                    "text_entry": {
                        "type": "string"
                    }
                }
            }
        }
    }
}

现在,当我运行查询以获取整个数据的说话人计数时,我得到以下结果。

$ curl -XPOST "https://my_es_id.us-east-1.es.amazonaws.com/shakespeare/_search" -d'
{
    "aggs" : {
        "speakers" : {
            "terms" : { "field" : "speaker.raw"}
        }
    }
}'

{
    "_shards": {
        "failed": 0,
        "successful": 5,
        "total": 5
    },
    "aggregations": {
        "speakers": {
            "buckets": [
                {
                    "doc_count": 4,
                    "key": "BASTARD"
                },
                {
                    "doc_count": 3,
                    "key": "HAMLET"
                },
                {
                    "doc_count": 3,
                    "key": "KING HENRY VIII"
                },
                {
                    "doc_count": 3,
                    "key": "OF SYRACUSE"
                },
                {
                    "doc_count": 3,
                    "key": "PROSPERO"
                },
                {
                    "doc_count": 3,
                    "key": "WARWICK"
                },
                {
                    "doc_count": 2,
                    "key": "ADRIANO DE ARMADO"
                },
                {
                    "doc_count": 2,
                    "key": "ARCHBISHOP OF YORK"
                },
                {
                    "doc_count": 2,
                    "key": "AUFIDIUS"
                },
                {
                    "doc_count": 2,
                    "key": "BENEDICK"
                }
            ],
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 153
        }
    },
    "hits": {
        "hits": [
            {
                "_id": "0",
                "_index": "shakespeare",
                "_score": 1.0,
                "_source": {
                    "line_id": 1,
                    "line_number": "",
                    "play_name": "Henry IV",
                    "speaker": "",
                    "speech_number": "",
                    "text_entry": "ACT I"
                },
                "_type": "act"
            },
            {
                "_id": "14",
                "_index": "shakespeare",
                "_score": 1.0,
                "_source": {
                    "line_id": 15,
                    "line_number": "1.1.12",
                    "play_name": "Henry IV",
                    "speaker": "KING HENRY IV",
                    "speech_number": 1,
                    "text_entry": "Did lately meet in the intestine shock"
                },
                "_type": "line"
            },
            {
                "_id": "19",
                "_index": "shakespeare",
                "_score": 1.0,
                "_source": {
                    "line_id": 20,
                    "line_number": "1.1.17",
                    "play_name": "Henry IV",
                    "speaker": "KING HENRY IV",
                    "speech_number": 1,
                    "text_entry": "The edge of war, like an ill-sheathed knife,"
                },
                "_type": "line"
            },
            {
                "_id": "22",
                "_index": "shakespeare",
                "_score": 1.0,
                "_source": {
                    "line_id": 23,
                    "line_number": "1.1.20",
                    "play_name": "Henry IV",
                    "speaker": "KING HENRY IV",
                    "speech_number": 1,
                    "text_entry": "Whose soldier now, under whose blessed cross"
                },
                "_type": "line"
            },
            {
                "_id": "24",
                "_index": "shakespeare",
                "_score": 1.0,
                "_source": {
                    "line_id": 25,
                    "line_number": "1.1.22",
                    "play_name": "Henry IV",
                    "speaker": "KING HENRY IV",
                    "speech_number": 1,
                    "text_entry": "Forthwith a power of English shall we levy;"
                },
                "_type": "line"
            },
            {
                "_id": "25",
                "_index": "shakespeare",
                "_score": 1.0,
                "_source": {
                    "line_id": 26,
                    "line_number": "1.1.23",
                    "play_name": "Henry IV",
                    "speaker": "KING HENRY IV",
                    "speech_number": 1,
                    "text_entry": "Whose arms were moulded in their mothers womb"
                },
                "_type": "line"
            },
            {
                "_id": "26",
                "_index": "shakespeare",
                "_score": 1.0,
                "_source": {
                    "line_id": 27,
                    "line_number": "1.1.24",
                    "play_name": "Henry IV",
                    "speaker": "KING HENRY IV",
                    "speech_number": 1,
                    "text_entry": "To chase these pagans in those holy fields"
                },
                "_type": "line"
            },
            {
                "_id": "29",
                "_index": "shakespeare",
                "_score": 1.0,
                "_source": {
                    "line_id": 30,
                    "line_number": "1.1.27",
                    "play_name": "Henry IV",
                    "speaker": "KING HENRY IV",
                    "speech_number": 1,
                    "text_entry": "For our advantage on the bitter cross."
                },
                "_type": "line"
            },
            {
                "_id": "40",
                "_index": "shakespeare",
                "_score": 1.0,
                "_source": {
                    "line_id": 41,
                    "line_number": "1.1.38",
                    "play_name": "Henry IV",
                    "speaker": "WESTMORELAND",
                    "speech_number": 2,
                    "text_entry": "Whose worst was, that the noble Mortimer,"
                },
                "_type": "line"
            },
            {
                "_id": "41",
                "_index": "shakespeare",
                "_score": 1.0,
                "_source": {
                    "line_id": 42,
                    "line_number": "1.1.39",
                    "play_name": "Henry IV",
                    "speaker": "WESTMORELAND",
                    "speech_number": 2,
                    "text_entry": "Leading the men of Herefordshire to fight"
                },
                "_type": "line"
            }
        ],
        "max_score": 1.0,
        "total": 111396
    },
    "timed_out": false,
    "took": 28
}

聚合桶中的文档数量似乎非常少。我希望看到的是以下具有文档计数的演讲者(以下是我通过显式评估整个数据的演讲者数量来计算的):

GLOUCESTER 1920
HAMLET 1582
IAGO 1161
FALSTAFF 1117
KING HENRY V 1086
BRUTUS 1051
OTHELLO 928
MARK ANTONY 927
KING HENRY VI 917
DUKE VINCENTIO 909

我花了几个小时在网上搜索这个问题的原因,但我无法理解。我做错了什么?

【问题讨论】:

    标签: elasticsearch amazon-elasticsearch


    【解决方案1】:

    根本原因是映射和搜索数据方式的错误。当应该为 doc_type:'line' 设置映射时,只为 doc_type:'act' 设置映射,而且搜索不应该覆盖所有内容,而只是 doc_type:'line'。

    详细回答:

    按照此页面中的示例:https://www.elastic.co/guide/en/elasticsearch/guide/current/aggregations-and-analysis.html 我意识到错误在映射中。

    之前:

    • 我没有意识到原始数据集有多个 doc_types。
    • 在映射中,只有 doc_type:'act' 具有字段:'speaker' 和 not_analyzed 版本
    • 我在搜索时没有设置任何 doc_type
    • 我原以为结果会从 doc_type:line 中提取扬声器,而实际上这些 doc_type 根本没有任何 'speaker.raw' 属性。
    • 鉴于此,问题中的计数也是错误的。

    之后:

    • 新映射为每个 doc_types:act/scene/line 的 field:'speaker' 添加了一个多字段。这是speaker.raw,未经分析。
    • 新的搜索,正确搜索线路的扬声器,这是最初的意图。
    • 弹性搜索的结果现在与我从该数据集中手动获取的计数相匹配。当前 doc_type:line 中前 10 名发言者的数量如下:

      格洛斯特 1907 哈姆雷特 1572 伊阿古 1153 福斯塔夫 1109 亨利国王 V 1076 布鲁图斯 1043 奥瑟罗 928 马克安东尼 915 亨利国王六世 909 杜克·文森蒂奥 901

    这是正确的映射:

    {
      "shakespeare" : {
        "mappings" : {
          "line" : {
            "properties" : {
              "line_id" : {
                "type" : "integer"
              },
              "line_number" : {
                "type" : "string"
              },
              "play_name" : {
                "type" : "string",
                "fields" : {
                  "raw" : {
                    "type" : "string",
                    "index" : "not_analyzed"
                  }
                }
              },
              "speaker" : {
                "type" : "string",
                "fields" : {
                  "raw" : {
                    "type" : "string",
                    "index" : "not_analyzed"
                  }
                }
              },
              "speech_number" : {
                "type" : "integer"
              },
              "text_entry" : {
                "type" : "string"
              }
            }
          },
          "act" : {
            "properties" : {
              "line_id" : {
                "type" : "integer"
              },
              "line_number" : {
                "type" : "string"
              },
              "play_name" : {
                "type" : "string",
                "fields" : {
                  "raw" : {
                    "type" : "string",
                    "index" : "not_analyzed"
                  }
                }
              },
              "speaker" : {
                "type" : "string",
                "fields" : {
                  "raw" : {
                    "type" : "string",
                    "index" : "not_analyzed"
                  }
                }
              },
              "speech_number" : {
                "type" : "integer"
              },
              "text_entry" : {
                "type" : "string"
              }
            }
          },
          "scene" : {
            "properties" : {
              "line_id" : {
                "type" : "integer"
              },
              "line_number" : {
                "type" : "string"
              },
              "play_name" : {
                "type" : "string",
                "fields" : {
                  "raw" : {
                    "type" : "string",
                    "index" : "not_analyzed"
                  }
                }
              },
              "speaker" : {
                "type" : "string",
                "fields" : {
                  "raw" : {
                    "type" : "string",
                    "index" : "not_analyzed"
                  }
                }
              },
              "speech_number" : {
                "type" : "integer"
              },
              "text_entry" : {
                "type" : "string"
              }
            }
          }
        }
      }
    }
    

    有了新的映射,结果看起来不错:

    curl -XPOST "https://my_es_id/shakespeare/line/_search" -d'
    {
        "aggs" : {
            "speakers" : {
                "terms" : { "field" : "speaker.raw"}
            }
        }
    }'
    {
        "_shards": {
            "failed": 0,
            "successful": 5,
            "total": 5
        },
        "aggregations": {
            "speakers": {
                "buckets": [
                    {
                        "doc_count": 1907,
                        "key": "GLOUCESTER"
                    },
                    {
                        "doc_count": 1572,
                        "key": "HAMLET"
                    },
                    {
                        "doc_count": 1153,
                        "key": "IAGO"
                    },
                    {
                        "doc_count": 1109,
                        "key": "FALSTAFF"
                    },
                    {
                        "doc_count": 1076,
                        "key": "KING HENRY V"
                    },
                    {
                        "doc_count": 1043,
                        "key": "BRUTUS"
                    },
                    {
                        "doc_count": 928,
                        "key": "OTHELLO"
                    },
                    {
                        "doc_count": 915,
                        "key": "MARK ANTONY"
                    },
                    {
                        "doc_count": 909,
                        "key": "KING HENRY VI"
                    },
                    {
                        "doc_count": 901,
                        "key": "DUKE VINCENTIO"
                    }
                ],
                "doc_count_error_upper_bound": 461,
                "sum_other_doc_count": 94715
            }
        },
        "hits": {
            "hits": [
                {
                    "_id": "14",
                    "_index": "shakespeare",
                    "_score": 1.0,
                    "_source": {
                        "line_id": 15,
                        "line_number": "1.1.12",
                        "play_name": "Henry IV",
                        "speaker": "KING HENRY IV",
                        "speech_number": 1,
                        "text_entry": "Did lately meet in the intestine shock"
                    },
                    "_type": "line"
                },
                {
                    "_id": "19",
                    "_index": "shakespeare",
                    "_score": 1.0,
                    "_source": {
                        "line_id": 20,
                        "line_number": "1.1.17",
                        "play_name": "Henry IV",
                        "speaker": "KING HENRY IV",
                        "speech_number": 1,
                        "text_entry": "The edge of war, like an ill-sheathed knife,"
                    },
                    "_type": "line"
                },
                {
                    "_id": "22",
                    "_index": "shakespeare",
                    "_score": 1.0,
                    "_source": {
                        "line_id": 23,
                        "line_number": "1.1.20",
                        "play_name": "Henry IV",
                        "speaker": "KING HENRY IV",
                        "speech_number": 1,
                        "text_entry": "Whose soldier now, under whose blessed cross"
                    },
                    "_type": "line"
                },
                {
                    "_id": "24",
                    "_index": "shakespeare",
                    "_score": 1.0,
                    "_source": {
                        "line_id": 25,
                        "line_number": "1.1.22",
                        "play_name": "Henry IV",
                        "speaker": "KING HENRY IV",
                        "speech_number": 1,
                        "text_entry": "Forthwith a power of English shall we levy;"
                    },
                    "_type": "line"
                },
                {
                    "_id": "25",
                    "_index": "shakespeare",
                    "_score": 1.0,
                    "_source": {
                        "line_id": 26,
                        "line_number": "1.1.23",
                        "play_name": "Henry IV",
                        "speaker": "KING HENRY IV",
                        "speech_number": 1,
                        "text_entry": "Whose arms were moulded in their mothers womb"
                    },
                    "_type": "line"
                },
                {
                    "_id": "26",
                    "_index": "shakespeare",
                    "_score": 1.0,
                    "_source": {
                        "line_id": 27,
                        "line_number": "1.1.24",
                        "play_name": "Henry IV",
                        "speaker": "KING HENRY IV",
                        "speech_number": 1,
                        "text_entry": "To chase these pagans in those holy fields"
                    },
                    "_type": "line"
                },
                {
                    "_id": "29",
                    "_index": "shakespeare",
                    "_score": 1.0,
                    "_source": {
                        "line_id": 30,
                        "line_number": "1.1.27",
                        "play_name": "Henry IV",
                        "speaker": "KING HENRY IV",
                        "speech_number": 1,
                        "text_entry": "For our advantage on the bitter cross."
                    },
                    "_type": "line"
                },
                {
                    "_id": "40",
                    "_index": "shakespeare",
                    "_score": 1.0,
                    "_source": {
                        "line_id": 41,
                        "line_number": "1.1.38",
                        "play_name": "Henry IV",
                        "speaker": "WESTMORELAND",
                        "speech_number": 2,
                        "text_entry": "Whose worst was, that the noble Mortimer,"
                    },
                    "_type": "line"
                },
                {
                    "_id": "41",
                    "_index": "shakespeare",
                    "_score": 1.0,
                    "_source": {
                        "line_id": 42,
                        "line_number": "1.1.39",
                        "play_name": "Henry IV",
                        "speaker": "WESTMORELAND",
                        "speech_number": 2,
                        "text_entry": "Leading the men of Herefordshire to fight"
                    },
                    "_type": "line"
                },
                {
                    "_id": "44",
                    "_index": "shakespeare",
                    "_score": 1.0,
                    "_source": {
                        "line_id": 45,
                        "line_number": "1.1.42",
                        "play_name": "Henry IV",
                        "speaker": "WESTMORELAND",
                        "speech_number": 2,
                        "text_entry": "A thousand of his people butchered;"
                    },
                    "_type": "line"
                }
            ],
            "max_score": 1.0,
            "total": 106228
        },
        "timed_out": false,
        "took": 48
    }
    

    【讨论】:

    • 您介意添加一个显示您如何解决它的答案吗?这将有助于其他人在未来找到您的问题:)
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2019-10-31
    • 2021-06-06
    • 2016-01-15
    • 2021-06-06
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多