为什么索引中列的顺序对于 Postgresql 中的 group by 很重要？答案

【问题标题】：Why does the order of columns in an index matter for a group by in Postgresql?为什么索引中列的顺序对于 Postgresql 中的 group by 很重要？
【发布时间】：2016-12-23 16:17:29
【问题描述】：

我有一个相对较大的表（大约一百万条记录），包含以下列：

帐户：字符变化 (36) 不为空
组：字符变化（255）不为空
分类：字符变化（255）不为空
大小：整数不为空

该帐户实际上是一个 UUID，但我认为这并不重要。如果我执行以下简单查询，我的机器上大约需要 16 秒：

select account, group, classification, max(size) 
from mytable 
group by account, group, classification

到目前为止一切顺利。假设我添加一个索引：

CREATE INDEX concurrently ON mytable (account, group, classification);

如果我再次执行相同的查询，它现在会在不到半秒的时间内返回结果。解释查询也清楚地表明使用了索引。

但是，如果我将查询改写为

select account, group, classification, max(size) 
from mytable 
group by account, classification, group

再次需要 16 秒，索引不再使用。在我看来，分组标准的顺序并不重要，但我不是专家。知道为什么 Postgresql 不能（或不能）优化后一个查询。我在 Postgresql 9.4 中试过这个

编辑：根据要求，这里是解释的输出。对于索引调用：

Group  (cost=0.55..133878.11 rows=95152 width=76) (actual time=0.090..660.739 rows=807 loops=1)
  Group Key: group_id, classification_id, account_id
  ->  Index Only Scan using mytable_group_id_classification_id_account_id_idx on mytable  (cost=0.55..126741.72 rows=951518 width=76) (actual time=0.088..534.645 rows=951518 loops=1)
        Heap Fetches: 951518
Planning time: 0.106 ms
Execution time: 660.852 ms

对于groupby条件顺序改变的调用：

Group  (cost=162327.31..171842.49 rows=95152 width=76) (actual time=11114.130..13938.487 rows=807 loops=1)"
  Group Key: group_id, account_id, classification_id
  ->  Sort  (cost=162327.31..164706.10 rows=951518 width=76) (actual time=11114.127..13775.235 rows=951518 loops=1)
        Sort Key: group_id, account_id, classification_id
        Sort Method: external merge  Disk: 81136kB
        ->  Seq Scan on mytable  (cost=0.00..25562.18 rows=951518 width=76) (actual time=0.009..192.259 rows=951518 loops=1)
Planning time: 0.111 ms
Execution time: 13948.380 ms

【问题讨论】：

请edit 您的问题并为这两种情况添加explain (analyze) 的输出。 Formatted 请发短信，no screen shots 另外：您尝试过更新的 Postgres 版本吗？在 9.5 和 9.6 中，特别是在聚合方面有一些增强
好吧，我认为顺序确实很重要，因为分组会向下滚动列表 - 意思是找到 1.grouped 列的所有值，并为每个值查找 2.grouped 列等的所有值。这与索引的组织方式相同。所以如果索引覆盖了同一个订单计划器中的所有列，可以直接使用它。但是，如果您按计划者对分组中的列重新排序，则不能使用具有不同列顺序的索引。
@a_horse_with_no_name 我还没有尝试过 postgresql 9.6 来查看是否有不同的行为。

标签： postgresql indexing group-by

【解决方案1】：

你说得对，不管GROUP BY子句中的列以什么顺序出现，结果都是一样的，并且可以使用相同的执行计划。

PostgreSQL 优化器只是不考虑重新排序 GROUP BY 表达式以查看不同的排序是否会匹配现有索引。

这是一个限制，您可以询问 pgsql-hackers 列表是否需要此处的增强功能。您可以使用实现所需功能的补丁来支持这一点。

但是，我不确定这样的增强是否会被接受。这种增强的缺点是优化器必须工作更多，这会影响使用GROUP BY 子句的所有查询的计划时间。此外，解决这个限制很容易：只需重写您的查询并更改GROUP BY 表达式的顺序。所以我会说事情应该保持现在的样子。

【讨论】：

【解决方案2】：

实际上，GROUP BY 子句中列的顺序确实会影响结果。默认情况下，结果将按GROUP BY 中的列排序。如果设置自己的ORDER BY，结果和索引使用是一样的。

演示：

CREATE TABLE coconuts (
  mass int,
  volume int,
  loveliness int
);

INSERT INTO coconuts (mass, volume, loveliness)
  SELECT (random() * 5)::int
       , (random() * 5)::int
       , (random() * 1000 + 9000)::int
  FROM GENERATE_SERIES(1,10000000);

注意GROUP BY 中列的顺序如何影响排序：

SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY mass, volume;

 mass | volume |  max  
------+--------+-------
    0 |      0 | 10000
    0 |      1 | 10000
    0 |      2 | 10000
...

SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY volume, mass;

 mass | volume |  max  
------+--------+-------
    0 |      0 | 10000
    1 |      0 | 10000
    2 |      0 | 10000
...

以及它如何影响查询计划：

CREATE INDEX ON coconuts (mass, volume);
SET enable_seqscan=false; --To force the index if possible

EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY (mass, volume);
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.46..460459.11 rows=40000 width=12)
   Group Key: mass, volume
   ->  Gather Merge  (cost=1000.46..459459.11 rows=80000 width=12)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.43..449225.10 rows=40000 width=12)
               Group Key: mass, volume
               ->  Parallel Index Scan using coconuts_mass_volume_idx on coconuts  (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)


EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY (volume, mass);
                                            QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=10001658532.83..10001758932.83 rows=40000 width=12)
   Group Key: volume, mass
   ->  Sort  (cost=10001658532.83..10001683532.83 rows=10000000 width=12)
         Sort Key: volume, mass
         ->  Seq Scan on coconuts  (cost=10000000000.00..10000154055.00 rows=10000000 width=12)
(5 rows)

但是，如果您将 ORDER BY 与原始 GROUP BY 匹配，则至少在 postgres 11.5 中会返回原始查询计划。

EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY volume, mass
  ORDER BY mass, volume;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.46..460459.11 rows=40000 width=12)
   Group Key: mass, volume
   ->  Gather Merge  (cost=1000.46..459459.11 rows=80000 width=12)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.43..449225.10 rows=40000 width=12)
               Group Key: mass, volume
               ->  Parallel Index Scan using coconuts_mass_volume_idx on coconuts  (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)

【讨论】：