如何在没有偏移的情况下对 Elasticsearch 中的嵌套对象进行分页？答案

【问题标题】：How to paginate nested objects in Elasticsearch without offset?如何在没有偏移的情况下对 Elasticsearch 中的嵌套对象进行分页？
【发布时间】：2020-06-30 21:31:05
【问题描述】：

我想在 Elasticsearch 7.X 中对嵌套数组进行分页——在这种情况下使用 from 和 size 不是一个选项，而是首选 search_after 或 Scroll API。

考虑以下（简化的）架构，其中字段 actions 作为嵌套对象：

{
  "protocol" : {
    "mappings" : {
      "properties" : {
        "actions" : {
          "type" : "nested",
          "properties" : {
            "data" : {},
            "timestamp" : {
              "type" : "date"
            },
            "type" : {
              "type" : "keyword"
            },
            "user" : {
              "type" : "keyword"
            }
          }
        }
      }
    }
  }
}

由于actions 数组可能在 15,000 - 20,000 个条目的范围内，我想对条目进行分页，而不是一次全部检索它们。我一次只需要考虑文档，因此无需将这些条目合并到多个文档中。

我已经尝试过使用inner_hits，使用log.timestamp 的date_histogram 和复合聚合进行分桶。但是，我无法实现我正在寻找的简单分页。分桶似乎是一条死胡同，因为我必须检索桶中的所有项，而不仅仅是任意数量的top_hits。

我们非常感谢任何指向正确方向的指针，因为我已经为此费尽心思了。

以下是我结合inner_hits使用的嵌套查询：

POST protocol/_search
{
  "_source": "false", 
  "query": {
    "nested": {
      "path": "actions",
      "query": {
        "match": {
          "_id": "<document-id>"
        }
      },
      "inner_hits": {}
    }
  }
}

上述查询产生以下结果：

{
  "took" : 867,
  "timed_out" : false,
  "_shards" : { ... },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "protocol",
        "_type" : "_doc",
        "_id" : "<document-id>",
        "_score" : 1.0,
        "_source" : { },
        "inner_hits" : {
          "actions" : {
            "hits" : {
              "total" : {
                "value" : 30,
                "relation" : "eq"
              },
              "max_score" : 1.0,
              "hits" : [
                {
                  "_index" : "protocol",
                  "_type" : "_doc",
                  "_id" : "<document-id>",
                  "_nested" : {
                    "field" : "actions",
                    "offset" : 0
                  },
                  "_score" : 1.0,
                  "_source" : {
                    "actor" : "<user-id>",
                    "data" : {
                      ... // arbitrary non-indexed payload
                    },
                    "type" : "attach",
                    "timestamp" : "2020-06-24T06:34:00.665Z"
                  }
                },
                {
                  "_index" : "protocol",
                  "_type" : "_doc",
                  "_id" : "<document-id>",
                  "_nested" : {
                    "field" : "actions",
                    "offset" : 1
                  },
                  "_score" : 1.0,
                  "_source" : {
                    "actor" : "<user-id>",
                    "data" : {
                      ... // arbitrary non-indexed payload
                    },
                    "type" : "update",
                    "timestamp" : "2020-06-23T13:09:04.089Z"
                  }
                },
                {
                  "_index" : "protocol",
                  "_type" : "_doc",
                  "_id" : "<document-id>",
                  "_nested" : {
                    "field" : "actions",
                    "offset" : 2
                  },
                  "_score" : 1.0,
                  "_source" : {
                    "actor" : "<user-id>",
                    "data" : {
                      ... // arbitrary non-indexed payload
                    },
                    "type" : "update",
                    "timestamp" : "2020-06-23T13:08:58.695Z"
                  }
                }
              ]
            }
          }
        }
      }
    ]
  }
}

【问题讨论】：

您能否添加您试图进一步提供帮助的内部点击查询。
@Gibbs 当然，我刚刚将我之前方法的查询和示例响应有效负载添加到原始问题。感谢您的观看！

标签： elasticsearch pagination

【解决方案1】：

经过几次失败的方法后，我终于设法解决了这个问题。尽管不是最优雅的解决方案，但以下复合聚合为嵌套对象中的每个单独项目创建了一个单独的存储桶。然后可以以分页方式检索存储桶。

POST protocol/_search
{
  "size": 0,
  "query": {
    "match": {
      "_id": "<document-id>"
    }
  },
  "aggs": {
    "actionByUserAndTimestamp": {
      "nested": {
        "path": "actions"
      },
      "aggs": {
        "log": {
          "composite": {
            "size": 10,
            "sources": [
              {
                "actionByActor": {
                  "terms": {
                    "field": "actions.actor"
                  }
                }
              },
              {
                "actionByTimestamp": {
                  "terms": {
                    "field": "actions.timestamp"
                  }
                }
              }
            ]
          },
          "aggs": {
            "items": {
              "top_hits": {
                "size": 1
              }
            }
          }
        }
      }
    }
  }
}

此解决方案的要求是复合聚合中的指定术语是唯一的组合。否则，一个存储桶中可能会有多个项目，由于顶部命中大小限制，因此不会考虑这些项目。

【讨论】：