【发布时间】:2016-04-04 20:59:57
【问题描述】:
我有一个简单的查询来连接两个非常慢的表。我发现查询计划对大表 email_activities(约 10m 行)进行 seq 扫描,而我认为使用索引执行嵌套循环实际上会更快。
我使用子查询重写了查询,试图强制使用索引,然后发现了一些有趣的东西。如果您查看下面的两个查询计划,您会看到当我将子查询的结果集限制为 43k 时,查询计划确实在 email_activities 上使用索引,而将子查询中的限制设置为 44k 将导致查询计划使用 seq scan on email_activities。一个显然比另一个更有效率,但 Postgres 似乎并不在意。
是什么原因造成的?如果其中一个集合大于某个大小,它是否有一个配置强制使用哈希连接?
explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1)
-> Nested Loop (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1)
-> HashAggregate (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1)
-> Limit (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1)
-> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1)
Index Cond: (email_campaign_id = 1607)
-> Index Only Scan using index_email_activities_on_email_recipient_id on email_activities (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000)
Index Cond: (email_recipient_id = email_recipients.id)
Heap Fetches: 40789
Total runtime: 224.675 ms
还有:
explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Aggregate (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1)
-> Hash Semi Join (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1)
Hash Cond: (email_activities.email_recipient_id = email_recipients.id)
-> Seq Scan on email_activities (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1)
-> Hash (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1)
Buckets: 8192 Batches: 1 Memory Usage: 1758kB
-> Limit (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1)
-> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1)
Index Cond: (email_campaign_id = 1607)
Total runtime: 3050.660 ms
- 版本:x86_64-unknown-linux-gnu 上的 PostgreSQL 9.3.10,由 gcc (Ubuntu/Linaro 4.6.3-1ubuntu5) 4.6.3 编译,64 位
- email_activities:~1000 万行
- email_recipients:~1100 万行
【问题讨论】:
-
HashAggregate操作可能需要太多内存来存储 50k 行。尝试增加work_mem? -
缺少基本信息。请考虑tag info for [postgresql-perfiormance] 中的说明。此外,您的第二个查询是针对
LIMIT 50000,而不是针对44k,如上所述。增加差异。 -
@ErwinBrandstetter,很抱歉造成混淆。我只是说将限制从 43k 提高到 44k 确实改变了对 seq 扫描的计划。 (从 50k 下降到 44k ......)。感谢您的标签信息。这是我第一次发布与 postgres 相关的帖子。
-
有人知道将 \d+ 粘贴到问题中的干净方法吗?
标签: sql postgresql postgresql-performance database-indexes