Elasticsearch 按字段分组答案

【问题标题】：Elasticsearch group by fieldElasticsearch 按字段分组
【发布时间】：2021-05-27 10:59:23
【问题描述】：

我想按字段对搜索结果进行分组。示例：我有 userId 对应于多个用户名的数据。因此，在搜索结果中，我想将所有 userId 及其对应的用户名分组。

目前使用聚合，我可以对 userId 进行分组，但无法检索其对应的用户名列表。 我得到以下信息。

"aggregations" : {
"by_user_id" : {
  "after_key" : {
    "group_by_search" : 2335
  },
  "buckets" : [
    {
      "key" : {
        "group_by_search" : 2
      },
      "doc_count" : 2
    },
    {
      "key" : {
        "group_by_search" : 1000
      },
      "doc_count" : 4
    },
    {
      "key" : {
        "group_by_search" : 2335
      },
      "doc_count" : 2
    }
  ]
}

我想要的是：

"aggregations" : {
"by_corp_id" : {
  "after_key" : {
    "group_by_search" : 2335
  },
  "buckets" : [
    {
      "key" : {
        "group_by_search" : 2
        "usernames":[1111,222] ***//this is list of usernames having same userId***
      },
      "doc_count" : 2
    },
    {
      "key" : {
        "group_by_search" : 1000
        "usernames":[11 ,0101,1199,222] ***//this is list of usernames having same userId***
      },
      "doc_count" : 4
    },
    {
      "key" : {
        "group_by_search" : 2335
        "usernames":[1111,222] ***//this is list of usernames having same userId***
      },
      "doc_count" : 2
    }
  ]
}

有没有办法在 Elasticsearch 中使用聚合直接实现这一点？

更新：我正在使用以下聚合

"aggregations": {
    "by_user_id": {
        "composite": {
            "size": 1000,
            "sources": [
                {
                    "group_by_search": {
                        "terms": {
                            "field": "user_id",
                            "missing_bucket": false,
                            "order": "asc"
                        }
                    }
                }
            ]
        }
    }
}

谢谢。

【问题讨论】：

你能不能也提供你正在做的聚合查询？
@Val 我已经用聚合值更新了问题。

标签： java elasticsearch elasticsearch-aggregation

【解决方案1】：

您需要做的只是在用户名字段上添加一个terms 子聚合，以便每个存储桶获取所有唯一用户名的列表：

"aggregations": {
    "by_user_id": {
        "composite": {
            "size": 1000,
            "sources": [
                {
                    "group_by_search": {
                        "terms": {
                            "field": "user_id",
                            "missing_bucket": false,
                            "order": "asc"
                        }
                    }
                }
            ]
        },
        "aggs": {
            "username": {
                "terms": {
                    "field": "username",
                    "size": 1000
                }
            }
        }
    }
}

top_hits 也是可能的，但您会得到很多重复项，并且您需要返回大量点击以确保您拥有所有可能的不同用户名。

如果您的用户名字段具有高基数 (>1000)，那么最好将用户名上的术语聚合移动到复合源数组中并自己迭代所有存储桶，如下所示：

"aggregations": {
    "by_user_id": {
        "composite": {
            "size": 1000,
            "sources": [
                {
                    "group_by_search": {
                        "terms": {
                            "field": "user_id",
                            "missing_bucket": false,
                            "order": "asc"
                        }
                    }
                },
                {
                    "group_by_username": {
                        "terms": {
                            "field": "username",
                            "missing_bucket": false,
                            "order": "asc"
                        }
                    }
                }
            ]
        }
    }
}

【讨论】：

top_hits 的最大允许大小为 100，但我也可以将 1000 个用户名作为 userId
@Manish 因此我建议使用术语
是的，你的解决方案看起来不错，还有一个疑问，如果用户名在1000以上怎么办，我应该使用分页还是什么？
你可以增加大小，但我的问题是给定的用户 ID 怎么可能有超过 1000 个不同的用户名？它背后的用例是什么？
用户名 - userId 是一种数据。同样，我们还有更多种类的数据。例子。 EmployeeId 和 CompanyId，所以一个公司可以有 10 万名员工也在为它工作。

【解决方案2】：

您可以使用top hits aggregation 获取具有相同 id 的所有用户名的列表。

添加一个工作示例

索引数据：

{
  "usernames": 3,
  "user_id": 2
}
{
  "usernames": 1,
  "user_id": 1
}
{
  "usernames": 2,
  "user_id": 1
}

搜索查询：

{
  "size": 0,
  "aggregations": {
    "by_user_id": {
      "composite": {
        "size": 1000,
        "sources": [
          {
            "group_by_search": {
              "terms": {
                "field": "user_id",
                "missing_bucket": false,
                "order": "asc"
              }
            }
          }
        ]
      },
      "aggs": {
        "list_names": {
          "top_hits": {
            "_source": {
              "includes": [
                "usernames"
              ]
            }
          }
        }
      }
    }
  }
}

搜索结果：

"aggregations": {
    "by_user_id": {
      "after_key": {
        "group_by_search": 2      
      },
      "buckets": [
        {
          "key": {
            "group_by_search": 1        // note this
          },
          "doc_count": 2,
          "list_names": {
            "hits": {
              "total": {
                "value": 2,
                "relation": "eq"
              },
              "max_score": 1.0,
              "hits": [
                {
                  "_index": "66362501",
                  "_type": "_doc",
                  "_id": "1",
                  "_score": 1.0,
                  "_source": {
                    "usernames": 1             // note this
                  }
                },
                {
                  "_index": "66362501",
                  "_type": "_doc",
                  "_id": "2",
                  "_score": 1.0,
                  "_source": {
                    "usernames": 2           // note this
                  }
                }
              ]
            }
          }
        },
        {
          "key": {
            "group_by_search": 2
          },
          "doc_count": 1,
          "list_names": {
            "hits": {
              "total": {
                "value": 1,
                "relation": "eq"
              },
              "max_score": 1.0,
              "hits": [
                {
                  "_index": "66362501",
                  "_type": "_doc",
                  "_id": "3",
                  "_score": 1.0,
                  "_source": {
                    "usernames": 3       
                  }
                }
              ]
            }
          }
        }
      ]
    }
  }

【讨论】：

terms 在这种情况下会比 top_hits 更好，因为 top_hits 可能会返回很多重复的用户名