SQL (Postgres)：在 EXISTS 中使用 JOIN 时性能不佳答案

【问题标题】：SQL (Postgres): Bad Performance with JOIN in EXISTSSQL (Postgres)：在 EXISTS 中使用 JOIN 时性能不佳
【发布时间】：2021-04-21 04:55:52
【问题描述】：

我有三个表：person、task 和一个联结表 person_task。（多对多关系）。我需要选择 1000 人的任务未关闭。我的SQL语句如下：

select p.name
from person p
where exists (
    select 1
    from person_task pt
    join task t on pt.task_id = t.id and t.state <> 'closed'
    where pt.person_id = p.id
)
limit 1000

这个语句这么慢（> 3min）的原因是什么？ EXPLAIN (ANALYZE) 生成以下结果：

Limit  (cost=1131131.27..3469646.25 rows=1000 width=8) (actual time=11.798..190565.952 rows=1000 loops=1)
  ->  Merge Semi Join  (cost=1131131.27..59201135.16 rows=24832 width=8) (actual time=11.796..190565.168 rows=1000 loops=1)
        Merge Cond: (p.id = pt.person_id)
        ->  Index Scan using personxpk on person p (cost=0.43..1384719.70 rows=2136286 width=16) (actual time=0.005..199.899 rows=1123 loops=1)
        ->  Gather Merge  (cost=1001.03..57305184.95 rows=40517669 width=8) (actual time=8.523..189657.455 rows=10451338 loops=1)
              Workers Planned: 2
              Workers Launched: 2
              ->  Nested Loop  (cost=1.00..52627440.58 rows=16882362 width=8) (actual time=0.588..72881.617 rows=3484269 loops=3)
                    ->  Parallel Index Scan using person_taskx1 on person_task pt (cost=0.56..25821867.88 rows=16882362 width=16) (actual time=0.028..12726.867 rows=3484269 loops=3)
                    ->  Index Scan using taskxpk on task t (cost=0.44..1.59 rows=1 width=8) (actual time=0.017..0.017 rows=1 loops=10452808)
                          Index Cond: (id = pt.task_id)
                          Filter: (state <> 'closed')
Planning Time: 0.627 ms
Execution Time: 190566.989 ms

【问题讨论】：

您可以将您的“where 子句与存在条件”转换为连接，然后尝试重新运行查询。
对所有涉及的表执行 VACUUM ANALYZE。这有什么改变吗？
解释输出中没有“Rows Removed by Filter:”行吗？您使用的是哪个版本的 PostgreSQL？
我使用的是 11.9 版。没有“被过滤器删除的行”
VACUUM ANALYZE 没有帮助.. :/

标签： sql postgresql performance join select

【解决方案1】：

对于这个查询：

select p.name
from person p
where exists (select 1
              from person_task pt join
                   task t
                   on pt.task_id = t.id and t.state <> 'closed'
              where pt.person_id = p.id
             )
limit 1000;

您需要以下索引：

person_task(person_id, task_id)
task(id, state)

其实，如果task(id)是主键，第二个就不需要了，这似乎是合理的。

没有第一个索引，子查询必须遍历大量数据才能找到特定的人。使用现有索引，您可能会发现这个公式更快：

select distinct on (p.id) p.*
from person p join
     person_task pt
     on pt.person_id = p.id join
     task t
     on t.id = pt.task_id
where t.state <> 'closed'
order by p.id;

【讨论】：

我在person_task(task_id, person_id) 上有一个索引。以及person_task(person_id)。后者用于并行索引扫描
我尝试在person_task(person_id, task_id) 上创建一个附加索引。但这并没有帮助