【发布时间】:2020-07-18 09:29:54
【问题描述】:
我被 jsonb 索引困住了,需要帮助。 我有一张带 jsonb 的桌子:
+-------+----------+------------------------------------------------------------+-------+
|id |measure_id|parameters |value |
+-------+----------+------------------------------------------------------------+-------+
|564174 |19 |{"1": 12, "2": 59, "5": 79, "6": 249, "7": 248, "8": 412} |42.461 |
|564176 |19 |{"1": 12, "2": 59, "5": 80, "6": 249, "7": 248, "8": 412} |46.198 |
|568244 |19 |{"1": 12, "2": 316, "5": 129, "6": 249, "7": 248, "8": 412} |19.482 |
|568246 |19 |{"1": 12, "2": 316, "5": 130, "6": 249, "7": 248, "8": 412} |20.051 |
|572313 |19 |{"1": 12, "2": 331, "5": 113, "6": 249, "7": 248, "8": 412} |7.098 |
|596434 |19 |{"1": 193, "2": 297, "5": 124, "6": 249, "7": 248, "8": 412}|103.253|
|682354 |22 |{"1": 427, "2": 25, "5": 121, "6": 426, "9": 441, "11": 428}|0.132 |
|686423 |22 |{"1": 427, "2": 60, "5": 72, "6": 426, "9": 443, "11": 428} |0.000 |
|1682439|44 |{"1": 193, "2": 518, "5": 91, "6": 426, "9": 429, "11": 431}|8.321 |
|1686787|44 |{"1": 193, "2": 515, "5": 96, "6": 426, "9": 429, "11": 431}|23.062 |
+-------+----------+------------------------------------------------------------+-------+
这是一些统计数据,每一行都有度量和一些参数设置。每个度量的参数数量都不同,因此我将它们放在 jsonb 列中。我必须做的:
-
选择所有不同的度量和参数:
SELECT DISTINCT measure_id, jsonb_object_keys(parameters) AS parameter_id, parameters -> jsonb_object_keys(parameters) AS parameter_value_id FROM data; -
从此表中选择数据:
SELECT d.id, d.measure_id, CAST(d.attributes as TEXT) as attributes, CAST(d.parameters as TEXT) as parameters, d.value FROM data d WHERE d.measure_id=19 AND (jsonb_extract_path(d.parameters, '1')::bigint in (12)) AND (jsonb_extract_path(d.parameters, '2')::bigint in (2,59)) AND (jsonb_extract_path(d.parameters, '5')::bigint in (79, 80, 129, 130, 113)) AND (jsonb_extract_path(d.parameters, '6')::bigint in (249)) AND (jsonb_extract_path(d.parameters, '7')::bigint in (248)) AND (jsonb_extract_path(d.parameters, '8')::bigint in (412)) ORDER BY d.id;
两个查询都运行缓慢。我的索引:
CREATE INDEX idx_data_measure ON data USING btree (measure_id);
CREATE INDEX idx_data_parameters
ON data USING btree (((parameters ->> '1'::text)::bigint), ((parameters ->> '2'::text)::bigint),
((parameters ->> '5'::text)::bigint), ((parameters ->> '6'::text)::bigint),
((parameters ->> '7'::text)::bigint), ((parameters ->> '8'::text)::bigint),
((parameters ->> '9'::text)::bigint), ((parameters ->> '10'::text)::bigint),
((parameters ->> '11'::text)::bigint), ((parameters ->> '458'::text)::bigint),
((parameters ->> '717'::text)::bigint), ((parameters ->> '718'::text)::bigint),
((parameters ->> '719'::text)::bigint), ((parameters ->> '720'::text)::bigint));
我尝试创建一个组合索引:
CREATE INDEX idx_data_parameters ON data USING btree (measure_id, ((parameters ->> '1'::text)::bigint),...
但这无济于事。
我试过EXPLAIN ANALYZE,但老实说我不明白:(
EXPLAIN ANALYZE
SELECT DISTINCT
measure_id,
jsonb_object_keys(parameters) AS parameter_id,
parameters -> jsonb_object_keys(parameters) AS parameter_value_id
FROM data;
QUERY PLAN
Unique (cost=2212571.28..2222400.17 rows=982889 width=72) (actual time=79346.142..84316.123 rows=5050 loops=1)
-> Sort (cost=2212571.28..2215028.50 rows=982889 width=72) (actual time=79346.141..82358.141 rows=5586011 loops=1)
Sort Key: measure_id, (jsonb_object_keys(parameters)), ((parameters -> (jsonb_object_keys(parameters))))"
Sort Method: external merge Disk: 202816kB
-> Gather (cost=1000.00..2034108.05 rows=982889 width=72) (actual time=2467.949..63448.545 rows=5586011 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Result (cost=0.00..1934819.15 rows=40953700 width=72) (actual time=2432.167..63305.298 rows=1862004 loops=3)
-> ProjectSet (cost=0.00..1218129.40 rows=40953700 width=156) (actual time=2432.151..62251.992 rows=1862004 loops=3)
-> Parallel Seq Scan on data (cost=0.00..1010289.37 rows=409537 width=124) (actual time=2432.118..61448.821 rows=327630 loops=3)
Planning Time: 0.417 ms
Execution Time: 84406.575 ms
我觉得我有错误的索引,但无法正确创建它。据我了解,GIN 不是好主意,因为我需要 IN 子句作为参数,所以我制作了 BTREE。请帮帮我。
编辑 1:PG 版本:PostgreSQL 11.8。还更新了查询以适应样本数据。
EDIT 2:选择数据SELECT...WHERE...的查询计划:
Sort (cost=1030.03..1030.04 rows=1 width=83) (actual time=63.659..63.661 rows=5 loops=1)
Sort Key: id
Sort Method: quicksort Memory: 26kB
Buffers: shared hit=4881
-> Index Scan using idx_data_measure on data d (cost=0.55..1030.02 rows=1 width=83) (actual time=0.044..63.635 rows=5 loops=1)
Index Cond: (measure_id = 19)
Filter: (((jsonb_extract_path(parameters, VARIADIC '{2}'::text[]))::bigint = ANY ('{2,59}'::bigint[])) AND ((jsonb_extract_path(parameters, VARIADIC '{1}'::text[]))::bigint = 12) AND ((jsonb_extract_path(parameters, VARIADIC '{6}'::text[]))::bigint = 249) AND ((jsonb_extract_path(parameters, VARIADIC '{7}'::text[]))::bigint = 248) AND ((jsonb_extract_path(parameters, VARIADIC '{8}'::text[]))::bigint = 412) AND ((jsonb_extract_path(parameters, VARIADIC '{5}'::text[]))::bigint = ANY ('{79,80,129,130,113}'::bigint[])))"
Rows Removed by Filter: 28733
Buffers: shared hit=4881
Planning Time: 0.451 ms
Execution Time: 64.973 ms
我看到 idx_data_measure 正在工作,仅此而已...
【问题讨论】:
-
您的
select distinct查询必须为每一行打开并展开parameters对象。您在此表上创建的索引比使用宽而稀疏的表或相关的表对要多得多的工作(和使用的空间)。 -
我添加了 pg 版本来发布并编辑查询以适应示例数据。 @a_horse_with_no_name 它用于单个值。但是如何处理多个值呢?让它
where parameters @> '...' OR parameters @>'...'?这将是一个很长的查询,因为用户可以选择任何参数集。 -
看来
measure_id上的条件已经足够好,可以使用该列上的索引,因此您可能不需要任何额外的索引。 -
@a_horse_with_no_name 很好,我的开发机器上有 3M 记录,在服务器上超过 200M 记录,这个索引还不够。
-
那么你应该添加来自生产服务器的
explain (analyze)输出
标签: postgresql indexing jsonb