【问题标题】：How to optimize filter for big data volume? PostgreSQL如何针对大数据量优化过滤器？ PostgreSQL
【发布时间】：2020-08-24 05:56:15
【问题描述】：

几周前，我们的团队在处理 SQL 查询时遇到了困难，因为数据量增加了很多。我们将不胜感激有关如何更新架构或优化查询以保持 status 过滤逻辑相同的任何建议。

简而言之：我们有两个表a 和b。 b 对 a 具有 FK 为 M-1。

一个

id | processed

1    TRUE

2    TRUE

b

a_id| status | type_id | l_id 

1     '1'      5          105  

1     '3'      6          105 

2     '2'      7          105

对于（l_id、type_id、a_id）的唯一组合，我们只能拥有一种状态。

我们需要计算由b 中的状态过滤的a 行数，这些行由a_id 分组。在表 a 中，我们有 5 300 000 行。在表中b 750 000 000 行。

所以我们需要通过以下规则计算每个a 行的状态：对于 a_id，b 中有 x 行：

1) 如果 x 的至少一个状态等于“3”，则a_id 的状态为“3”。

2) 如果 x 的所有状态都等于 1，则状态为 1。

等等。

在当前的方法中，我们使用 array_agg() 函数来过滤子选择。所以我们的查询看起来像：

SELECT COUNT(*)
FROM (
       SELECT
       FROM (
              SELECT at.id                         as id,
                     BOOL_AND(bt.processed)        AS not_pending,
                     ARRAY_AGG(DISTINCT bt.status) AS status
              FROM a AS at
                     LEFT OUTER JOIN b AS bt
                                     ON (at.id = bt.a_id AND bt.l_id = 105 AND
                                         bt.type_id IN (2,10,18,1,4,5,6))
              WHERE at.processed = True
              GROUP BY at.id) sub
       WHERE not_pending = True
         AND status <@ ARRAY ['1']::"char"[]
     ) counter
;

我们的计划如下：

Aggregate  (cost=14665999.33..14665999.34 rows=1 width=8) (actual time=1875987.846..1875987.846 rows=1 loops=1)
  ->  GroupAggregate  (cost=14166691.70..14599096.58 rows=5352220 width=37) (actual time=1875987.844..1875987.844 rows=0 loops=1)
        Group Key: at.id
        Filter: (bool_and(bt.processed) AND (array_agg(DISTINCT bt.status) <@ '{1}'::"char"[]))
        Rows Removed by Filter: 5353930
        ->  Sort  (cost=14166691.70..14258067.23 rows=36550213 width=6) (actual time=1860315.593..1864175.762 rows=37430745 loops=1)
              Sort Key: at.id
              Sort Method: external merge  Disk: 586000kB
              ->  Hash Right Join  (cost=1135654.48..8076230.39 rows=36550213 width=6) (actual time=55665.584..1846965.271 rows=37430745 loops=1)
                    Hash Cond: (bt.a_id = at.id)
                    ->  Bitmap Heap Scan on b bt  (cost=882095.79..7418660.65 rows=36704370 width=6) (actual time=51871.658..1826058.186 rows=37430378 loops=1)
                          Recheck Cond: ((l_id = 105) AND (type_id = ANY ('{2,10,18,1,4,5,6}'::integer[])))
                          Rows Removed by Index Recheck: 574462752
                          Heap Blocks: exact=28898 lossy=5726508
                          ->  Bitmap Index Scan on db_page_index_atableobjects  (cost=0.00..872919.69 rows=36704370 width=0) (actual time=51861.815..51861.815 rows=37586483 loops=1)
                                Index Cond: ((l_id = 105) AND (type_id = ANY ('{2,10,18,1,4,5,6}'::integer[])))
                    ->  Hash  (cost=165747.94..165747.94 rows=5352220 width=4) (actual time=3791.710..3791.710 rows=5353930 loops=1)
                          Buckets: 131072  Batches: 128  Memory Usage: 2507kB
                          ->  Seq Scan on a at  (cost=0.00..165747.94 rows=5352220 width=4) (actual time=0.528..2958.004 rows=5353930 loops=1)
                                Filter: processed
                                Rows Removed by Filter: 18659
Planning time: 0.328 ms
Execution time: 1876066.242 ms

正如您所见，查询执行的时间非常长，我们希望它至少

启用track_io_timing 的计划：

Aggregate  (cost=14665999.33..14665999.34 rows=1 width=8) (actual time=2820945.285..2820945.285 rows=1 loops=1)
  Buffers: shared hit=23 read=5998844, temp read=414465 written=414880
  I/O Timings: read=2655805.505
  ->  GroupAggregate  (cost=14166691.70..14599096.58 rows=5352220 width=930) (actual time=2820945.283..2820945.283 rows=0 loops=1)
        Group Key: at.id
        Filter: (bool_and(bt.processed) AND (array_agg(DISTINCT bt.status) <@ '{1}'::"char"[]))
        Rows Removed by Filter: 5353930
        Buffers: shared hit=23 read=5998844, temp read=414465 written=414880
        I/O Timings: read=2655805.505
        ->  Sort  (cost=14166691.70..14258067.23 rows=36550213 width=6) (actual time=2804900.123..2808826.358 rows=37430745 loops=1)
              Sort Key: at.id
              Sort Method: external merge  Disk: 586000kB
              Buffers: shared hit=18 read=5998840, temp read=414465 written=414880
              I/O Timings: read=2655805.491
              ->  Hash Right Join  (cost=1135654.48..8076230.39 rows=36550213 width=6) (actual time=55370.788..2791441.542 rows=37430745 loops=1)
                    Hash Cond: (bt.a_id = at.id)
                    Buffers: shared hit=15 read=5998840, temp read=142879 written=142625
                    I/O Timings: read=2655805.491
                    ->  Bitmap Heap Scan on b bt  (cost=882095.79..7418660.65 rows=36704370 width=6) (actual time=51059.047..2769127.810 rows=37430378 loops=1)
                          Recheck Cond: ((l_id = 105) AND (type_id = ANY ('{2,10,18,1,4,5,6}'::integer[])))
                          Rows Removed by Index Recheck: 574462752
                          Heap Blocks: exact=28898 lossy=5726508
                          Buffers: shared hit=13 read=5886842
                          I/O Timings: read=2653254.939
                          ->  Bitmap Index Scan on db_page_index_atableobjects  (cost=0.00..872919.69 rows=36704370 width=0) (actual time=51049.365..51049.365 rows=37586483 loops=1)
                                Index Cond: ((l_id = 105) AND (type_id = ANY ('{2,10,18,1,4,5,6}'::integer[])))
                                Buffers: shared hit=12 read=131437
                                I/O Timings: read=49031.671
                    ->  Hash  (cost=165747.94..165747.94 rows=5352220 width=4) (actual time=4309.761..4309.761 rows=5353930 loops=1)
                          Buckets: 131072  Batches: 128  Memory Usage: 2507kB
                          Buffers: shared hit=2 read=111998, temp written=15500
                          I/O Timings: read=2550.551
                          ->  Seq Scan on a at  (cost=0.00..165747.94 rows=5352220 width=4) (actual time=0.515..3457.040 rows=5353930 loops=1)
                                Filter: processed
                                Rows Removed by Filter: 18659
                                Buffers: shared hit=2 read=111998
                                I/O Timings: read=2550.551
Planning time: 0.347 ms
Execution time: 2821022.622 ms

【问题讨论】：

work_mem 的当前值是多少？您可以尝试增加很多，但只能在当前会话中减少重新检查条件步骤。
从性能可以接受到现在数据量增加了多少？ 2折？一万倍？您对旧数据的查询有计划吗？
@pifor，目前，我们更多地考虑优化而不是扩展的可能性。
@jjanes 你好！很抱歉这么晚的反馈。 1）目前，它不是来自生产的真实数据量。我们决定生成数据来测试我们当前的基础设施和应用程序将如何工作。目前，我们使用 db.r5.xlarge AWS RDS 实例，具有 2 个内核、32GB RAM 和 4 个 vCPU。
@jjanes 启用track_io_timing 的计划附在更新的问题正文中。谢谢！

标签： sql postgresql relational-database postgresql-11

【解决方案1】：

在当前计划中，几乎所有时间都将用于读取位图堆扫描的表页。您必须已经有类似(l_id, type_id) 的索引。如果您将其更改（创建一个新的，然后可以选择删除旧的）改为(ld_id, type_id, processed, a_id, status)，或者可能是(ld_id, type_id, a_id, status) where processed)，那么它可能会切换到仅索引扫描，这样可以避免读取表数据存在于索引中。您将需要确保该表已充分抽空，以使该策略有效。我只会在构建索引之前手动清理表一次，然后如果它有效，那么您可以担心如何保持它的良好清理。

另一种选择是增加 Effective_io_concurrency（我只是将它设置为 20。如果它有效；您可以多玩它以找到最佳设置），这样表上的多个 IO 读取请求就可以一下子出类拔萃。这将如何有效取决于您的 IO 系统，我不知道 db.r5.xlarge 的答案。仅索引扫描更好，因为它使用更少的资源，而这种方法只是更快地使用相同的资源。（如果您同时运行多个类似的查询，这很重要。另外，如果您按 IO 付费，则希望数量更少，而不是更快的数量）

另一种选择是尝试通过从 a 到 b 的嵌套循环来完全改变计划的形状。为此，您需要在 b 上建立一个索引，其中包含 a_id 和 l_id 作为前导列（按任意顺序）。如果你已经有这样的索引并且它不会自然地选择这样的计划，你可能可以通过set enable_hashjoin=off. 强制执行我的直觉这是一个需要踢对方 5,353,930 次的嵌套循环不会比你目前拥有的更好，即使对方有一个有效的索引。

【讨论】：

非常感谢！你真的尽可能地帮助了我们 xD 我们已经运行了真空并重建了索引，现在它工作正常。我不知道为什么，但我什至没有考虑在索引构建之前运行VACUUM ANALYZE。我相信问题是因为我们在表中生成了新数据，但是 PostgreSQL 没有重新分析这些更改并在没有新信息的情况下构建了计划。现在我们的计划使用Index-Only scan，IN 子句中的 1 个 type_id 需要大约 10 秒。现在唯一的问题是，条件中出现更多type_id 需要更多时间。
我不确定我们能否以某种方式对其进行优化，也许增加 CPU 或内存只是一种可能的解决方案，将检查。关于保持数据库良好的真空。您对此有什么第一手经验，或者对我们有什么建议吗？据我了解，我们可以将AUTOVACUUM 配置为b 更严格一些，然后它应该会有所帮助，但对其配置有什么建议吗？

【解决方案2】：

您可以在将表 B 与 A 连接之前对其进行过滤和分组。并按 ID 对两个表进行排序，因为它在处理连接操作时提高了表扫描的速度。请检查此代码：

with at as (
select distinct at.id, at.processed
from a AS at
WHERE at.processed = True
order by at.id
),

bt as (
select bt.a_id, bt.l_id, bt.type_id, --BOOL_AND(bt.processed) AS not_pending, 
ARRAY_AGG(DISTINCT bt.status) as status
from b AS bt
group by bt.a_id, bt.l_id, bt.type_id
having bt.l_id = 105 AND bt.type_id IN (2,10,18,1,4,5,6)
order by bt.a_id
),

counter as (
select at.id, 
case 
when '1' = all(status) then '1' 
when '3' = any(status) then '3' 
else status end as status
from at inner join bt on at.id=bt.a_id
)

select count (*) from counter where status='1'

【讨论】：

嗨！我不确定这是否适合我们，因为对如此大量的数据进行排序会花费太多时间。
嗨。在过滤和分组数据之后执行排序。原始脚本在任何分组之前加入表，这是执行时间长的原因
@Vad1m 不能先分组导致不同的（可能是错误的）答案，因为连接可能会删除行并因此更改计数？
@jjanes 是的，如果有聚合函数执行行计数（计数、平均值等），它可能会导致另一个答案。但是没有这样的。关于联接子句：联接表时，查询进行表扫描（如果表上没有任何索引）。这意味着它比较一个表中每个键的值和另一个表中的键。这接近 4 * 10^15 次操作（530 万条记录 * 7.5 亿条记录）。这个数量应该减少
@Vad1m 是的，有很多哈希比较，但我看不出它怎么可能像你列出的那样高。在探测哈希之前，7.5 亿被过滤到 3700 万。它不会将一侧的每个值与另一侧的每个值进行比较，这就是哈希连接的全部意义所在。如果散列连接本身很慢，则 EXPLAIN (ANALYZE) 会显示这一点，但事实并非如此。 98.8% 的哈希连接时间用于等待来自其问题子节点的行。