未用于过滤聚合查询的索引答案

【问题标题】：Index not used on filterd aggregate query未用于过滤聚合查询的索引
【发布时间】：2021-03-23 15:19:06
【问题描述】：

我需要提高以下查询的性能，该查询过滤列 status_classification 并聚合 classification -> 'flags'（表单中的 jsonb 字段：'{"flags": ["NO_CLASS_FOUND"]}'::jsonb）：

SELECT SUM(CASE WHEN ("result_materials"."classification" -> 'flags') @> '["NO_CLASS_FOUND"]' THEN 1 ELSE 0 END) AS "no_class_found",
    SUM(CASE WHEN ("result_materials"."classification" -> 'flags') @> '["RULE"]' THEN 1 ELSE 0 END) AS "rule",
    SUM(CASE WHEN ("result_materials"."classification" -> 'flags') @> '["NO_MAPPING"]' THEN 1 ELSE 0 END) AS "no_mapping"
FROM "result_materials"
WHERE "result_materials"."status_classification" = 'PROCESSED';

为了提高性能，我在status_classification 上创建了一个索引，但查询计划显示该索引从未命中，并且执行了Seq Scan：

 Aggregate  (cost=1010.15..1010.16 rows=1 width=24) (actual time=19.942..19.946 rows=1 loops=1)
   ->  Seq Scan on result_materials  (cost=0.00..869.95 rows=6231 width=202) (actual time=0.024..4.660 rows=6231 loops=1)
         Filter: ((status_classification)::text = 'PROCESSED'::text)
         Rows Removed by Filter: 5
 Planning Time: 1.212 ms
 Execution Time: 20.187 ms

我试过了（问题末尾的所有 sql）：

将索引添加到status_classification
将 GIN 索引添加到 classification -> 'flags'
添加多字段 GIN 索引，使用 classification -> 'flags' 和 status_classification（参见 here）

索引仍未命中，并且随着表的增长，性能会受到影响。 status_classification 字段中的基数较低，但 classification -> 'flags' 中的条目非常少，所以我认为这里的索引非常实用。

为什么不使用索引？我做错了什么？

SQL 重新创建我的数据库：

create table result_materials (
  uuid int,
  status_classification varchar(30),
  classification jsonb
);

insert into result_materials(uuid, classification, status_classification)
select seq
  , case(random() *2)::int
    when 0 then '{"flags": ["NO_CLASS_FOUND"]}'::jsonb
    when 1 then '{"flags": ["RULE"]}'::jsonb
    when 2 then '{"flags": ["NO_MAPPING"]}'::jsonb end
        as dummy
  , case(random() *2)::int
    when 0 then 'NOT_PROCESSABLE'
    when 1 then 'PROCESSABLE' end
        as sta
from generate_series(1, 150000) seq;

尝试的索引：

-- status_classification
create index other_testes on result_materials (status_classification);

-- classification -> 'flags'
CREATE INDEX idx_testes ON result_materials USING gin ((classification -> 'flags'));

-- multi field gin
-- REQUIRES you to run: CREATE EXTENSION btree_gin;
CREATE INDEX idx_testes ON result_materials USING gin ((classification -> 'flags'), status_classification);

【问题讨论】：

没有索引可以加快速度。索引用于减少查询需要处理的行数。但是您的 WHERE 条件仅删除了 6236 行中的 5 行。因此，您的查询基本上会遍历表的所有行。为此，没有索引会有所帮助
顺便说一句：case() 表达式也可以写成：count(*) filter (where classification @> '{"flags": ["NO_MAPPING"]}')
@a_horse_with_no_name 感谢您的反馈，非常感谢。我将尝试过滤方法。放心，我不会在索引方面发疯。
注意，filter() 不会加快速度 - 但我发现这更容易阅读。我认为加快速度的唯一方法是进行并行 seq 扫描
但我确实发现 6236 行需要 20 毫秒非常慢。但是在您提供的示例脚本中，使用了索引，因为 WHERE 条件过滤掉了一半的行。我从您的脚本中获得了 150000 行的 20 毫秒运行时间。只有 6k 行，不到 1 毫秒

标签： postgresql indexing

【解决方案1】：

查询耗时 20 毫秒，仅删除 5 行 6k，是的，扫描是一个不错的选择。尝试向表中添加更多行，并检查子句的基数。

【讨论】：