一、聚合分析简介
1. ES聚合分析是什么?
聚合分析是数据库中重要的功能特性,完成对一个查询的数据集中数据的聚合计算,如:找出某字段(或计算表达式的结果)的最大值、最小值,计算和、平均值等。ES作为搜索引擎兼数据库,同样提供了强大的聚合分析能力。
对一个数据集求最大、最小、和、平均值等指标的聚合,在ES中称为指标聚合 metric
而关系型数据库中除了有聚合函数外,还可以对查询出的数据进行分组group by,再在组上进行指标聚合。在 ES 中group by 称为分桶,桶聚合 bucketing
ES中还提供了矩阵聚合(matrix)、管道聚合(pipleline),但还在完善中。
2. ES聚合分析查询的写法
在查询请求体中以aggregations节点按如下语法定义聚合分析:
"aggregations" : { "<aggregation_name>" : { <!--聚合的名字 --> "<aggregation_type>" : { <!--聚合的类型 --> <aggregation_body> <!--聚合体:对哪些字段进行聚合 --> } [,"meta" : { [<meta_data_body>] } ]? <!--元 --> [,"aggregations" : { [<sub_aggregation>]+ } ]? <!--在聚合里面在定义子聚合 --> } [,"<aggregation_name_2>" : { ... } ]*<!--聚合的名字 --> }
说明:
aggregations 也可简写为 aggs
3. 聚合分析的值来源
聚合计算的值可以取字段的值,也可是脚本计算的结果。
二、指标聚合
1. max min sum avg
示例1:查询所有客户中余额的最大值
POST /bank/_search?
{
"size": 0,
"aggs": {
"masssbalance": {
"max": {
"field": "balance"
}
}
}
}
结果1:
{
"took": 2080,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"masssbalance": {
"value": 49989
}
}
}
示例2:查询年龄为24岁的客户中的余额最大值
POST /bank/_search?
{
"size": 2,
"query": {
"match": {
"age": 24
}
},
"sort": [
{
"balance": {
"order": "desc"
}
}
],
"aggs": {
"max_balance": {
"max": {
"field": "balance"
}
}
}
}
结果2:
{
"took": 5,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 42,
"max_score": null,
"hits": [
{
"_index": "bank",
"_type": "_doc",
"_id": "697",
"_score": null,
"_source": {
"account_number": 697,
"balance": 48745,
"firstname": "Mallory",
"lastname": "Emerson",
"age": 24,
"gender": "F",
"address": "318 Dunne Court",
"employer": "Exoplode",
"email": "malloryemerson@exoplode.com",
"city": "Montura",
"state": "LA"
},
"sort": [
48745
]
},
{
"_index": "bank",
"_type": "_doc",
"_id": "917",
"_score": null,
"_source": {
"account_number": 917,
"balance": 47782,
"firstname": "Parks",
"lastname": "Hurst",
"age": 24,
"gender": "M",
"address": "933 Cozine Avenue",
"employer": "Pyramis",
"email": "parkshurst@pyramis.com",
"city": "Lindcove",
"state": "GA"
},
"sort": [
47782
]
}
]
},
"aggregations": {
"max_balance": {
"value": 48745
}
}
}
示例3:值来源于脚本,查询所有客户的平均年龄是多少,并对平均年龄加10
POST /bank/_search?size=0
{
"aggs": {
"avg_age": {
"avg": {
"script": {
"source": "doc.age.value"
}
}
},
"avg_age10": {
"avg": {
"script": {
"source": "doc.age.value + 10"
}
}
}
}
}
结果3:
{
"took": 86,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"avg_age": {
"value": 30.171
},
"avg_age10": {
"value": 40.171
}
}
}
示例4:指定field,在脚本中用_value 取字段的值
POST /bank/_search?size=0
{
"aggs": {
"sum_balance": {
"sum": {
"field": "balance",
"script": {
"source": "_value * 1.03"
}
}
}
}
}
结果4:
{
"took": 165,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"sum_balance": {
"value": 26486282.11
}
}
}
示例5:为没有值字段指定值。如未指定,缺失该字段值的文档将被忽略。
POST /bank/_search?size=0
{
"aggs": {
"avg_age": {
"avg": {
"field": "age",
"missing": 18
}
}
}
}
2. 文档计数 count
示例1:统计银行索引bank下年龄为24的文档数量
POST /bank/_doc/_count
{
"query": {
"match": {
"age" : 24
}
}
}
结果1:
{
"count": 42,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
}
}
3. Value count 统计某字段有值的文档数
示例1:
POST /bank/_search?size=0
{
"aggs": {
"age_count": {
"value_count": {
"field": "age"
}
}
}
}
结果1:
{
"took": 2022,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"age_count": {
"value": 1000
}
}
}
4. cardinality 值去重计数
示例1:
POST /bank/_search?size=0
{
"aggs": {
"age_count": {
"cardinality": {
"field": "age"
}
},
"state_count": {
"cardinality": {
"field": "state.keyword"
}
}
}
}
说明:state的使用它的keyword版
结果1:
{
"took": 2074,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"state_count": {
"value": 51
},
"age_count": {
"value": 21
}
}
}
5. stats 统计 count max min avg sum 5个值
示例1:
POST /bank/_search?size=0
{
"aggs": {
"age_stats": {
"stats": {
"field": "age"
}
}
}
}
结果1:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"age_stats": {
"count": 1000,
"min": 20,
"max": 40,
"avg": 30.171,
"sum": 30171
}
}
}
6. Extended stats
高级统计,比stats多4个统计结果: 平方和、方差、标准差、平均值加/减两个标准差的区间
示例1:
POST /bank/_search?size=0
{
"aggs": {
"age_stats": {
"extended_stats": {
"field": "age"
}
}
}
}
结果1:
{
"took": 7,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"age_stats": {
"count": 1000,
"min": 20,
"max": 40,
"avg": 30.171,
"sum": 30171,
"sum_of_squares": 946393,
"variance": 36.10375899999996,
"std_deviation": 6.008640362012022,
"std_deviation_bounds": {
"upper": 42.18828072402404,
"lower": 18.153719275975956
}
}
}
}
7. Percentiles 占比百分位对应的值统计
对指定字段(脚本)的值按从小到大累计每个值对应的文档数的占比(占所有命中文档数的百分比),返回指定占比比例对应的值。默认返回[ 1, 5, 25, 50, 75, 95, 99 ]分位上的值。如下中间的结果,可以理解为:占比为50%的文档的age值 <= 31,或反过来:age<=31的文档数占总命中文档数的50%
示例1:
POST /bank/_search?size=0
{
"aggs": {
"age_percents": {
"percentiles": {
"field": "age"
}
}
}
}
结果1:
{
"took": 87,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"age_percents": {
"values": {
"1.0": 20,
"5.0": 21,
"25.0": 25,
"50.0": 31,
"75.0": 35.00000000000001,
"95.0": 39,
"99.0": 40
}
}
}
}
结果说明:
占比为50%的文档的age值 <= 31,或反过来:age<=31的文档数占总命中文档数的50%
示例2:指定分位值
POST /bank/_search?size=0
{
"aggs": {
"age_percents": {
"percentiles": {
"field": "age",
"percents" : [95, 99, 99.9]
}
}
}
}
结果2:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"age_percents": {
"values": {
"95.0": 39,
"99.0": 40,
"99.9": 40
}
}
}
}
8. Percentiles rank 统计值小于等于指定值的文档占比
示例1:统计年龄小于25和30的文档的占比,和第7项相反
POST /bank/_search?size=0
{
"aggs": {
"gge_perc_rank": {
"percentile_ranks": {
"field": "age",
"values": [
25,
30
]
}
}
}
}
结果2:
{
"took": 8,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"gge_perc_rank": {
"values": {
"25.0": 26.1,
"30.0": 49.2
}
}
}
}
结果说明:年龄小于25的文档占比为26.1%,年龄小于30的文档占比为49.2%,
9. Geo Bounds aggregation 求文档集中的地理位置坐标点的范围
参考官网链接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geobounds-aggregation.html
10. Geo Centroid aggregation 求地理位置中心点坐标值
参考官网链接:
https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-geocentroid-aggregation.html
三、桶聚合
1. Terms Aggregation 根据字段值项分组聚合
示例1:
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age"
}
}
}
}
结果1:
{
"took": 2000,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 463,
"buckets": [
{
"key": 31,
"doc_count": 61
},
{
"key": 39,
"doc_count": 60
},
{
"key": 26,
"doc_count": 59
},
{
"key": 32,
"doc_count": 52
},
{
"key": 35,
"doc_count": 52
},
{
"key": 36,
"doc_count": 52
},
{
"key": 22,
"doc_count": 51
},
{
"key": 28,
"doc_count": 51
},
{
"key": 33,
"doc_count": 50
},
{
"key": 34,
"doc_count": 49
}
]
}
}
}
结果说明:
"doc_count_error_upper_bound": 0:文档计数的最大偏差值
"sum_other_doc_count": 463:未返回的其他项的文档数
默认情况下返回按文档计数从高到低的前10个分组:
"buckets": [
{
"key": 31,
"doc_count": 61
},
{
"key": 39,
"doc_count": 60
},
.............
]
年龄为31的文档有61个,年龄为39的文档有60个
size 指定返回多少个分组:
示例2:指定返回20个分组
POST /bank/_search?size=0
{
"aggs": {
"age_terms": {
"terms": {
"field": "age",
"size": 20
}
}
}
}
结果2:
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1000,
"max_score": 0,
"hits": []
},
"aggregations": {
"age_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 35,
"buckets": [
{
"key": 31,
"doc_count": 61
},
{
"key": 39,
"doc_count": 60
},
{
"key": 26,
"doc_count": 59
},
{
"key": 32,
"doc_count": 52
},
{
"key": 35,
"doc_count": 52
},
{
"key": 36,
"doc_count": 52
},
{
"key": 22,
"doc_count": 51
},
{
"key": 28,
"doc_count": 51
},
{
"key": 33,
"doc_count": 50
},
{
"key": 34,
"doc_count": 49
},
{
"key": 30,
"doc_count": 47
},
{
"key": 21,
"doc_count": 46
},
{
"key": 40,
"doc_count": 45
},
{
"key": 20,
"doc_count": 44
},
{
"key": 23,
"doc_count": 42
},
{
"key": 24,
"doc_count": 42
},
{
"key": 25,
"doc_count": 42
},
{
"key": 37,
"doc_count": 42
},
{
"key": 27,
"doc_count": 39
},
{
"key": 38,
"doc_count": 39
}
]
}
}
}