【发布时间】:2017-03-24 17:06:51
【问题描述】:
我想查询一个表并为该月最后一天的所有行求和一列。
我们以下表为例:
CREATE TABLE example(dt date, value int, other1 int, other2 int, other3 int);
CREATE INDEX ON example (as_of);
我的查询如下所示:
SELECT dt, SUM(value)
FROM example
WHERE dt in (select date_trunc('month', d) + interval '1 month - 1 day'
from generate_series('2012-01-01'::date, '2016-11-10'::date, interval '1 month') dates(d))
GROUP BY dt
如果我查看查询计划,我发现它正在对表执行顺序扫描:
EXPLAIN ANALYSE SELECT dt, SUM(value)
FROM example
WHERE dt in (select date_trunc('month', d) + interval '1 month - 1 day'
from generate_series('2012-01-01'::date, '2016-11-10'::date, interval '1 month') dates(d))
GROUP BY dt
GroupAggregate (cost=825385.12..871490.30 rows=1536 width=12) (actual time=4323.887..6141.401 rows=56 loops=1)
Group Key: example.Dt
-> Merge Join (cost=825385.12..863846.28 rows=1525732 width=12) (actual time=4323.811..6118.514 rows=101102 loops=1)
Merge Cond: (example.dt = ((date_trunc('month'::text, dates.d) + '1 mon -1 days'::interval)))
-> Sort (cost=825312.64..832941.30 rows=3051464 width=12) (actual time=4323.585..5303.902 rows=3051464 loops=1)
Sort Key: example.dt
Sort Method: external merge Disk: 77512kB
-> Seq Scan on example (cost=0.00..392353.64 rows=3051464 width=12) (actual time=10.385..1748.592 rows=3051464 loops=1)
-> Sort (cost=72.48..72.98 rows=200 width=8) (actual time=0.168..18.248 rows=101105 loops=1)
Sort Key: ((date_trunc('month'::text, dates.d) + '1 mon -1 days'::interval))
Sort Method: quicksort Memory: 27kB
-> Unique (cost=59.84..64.84 rows=200 width=8) (actual time=0.108..0.143 rows=59 loops=1)
-> Sort (cost=59.84..62.34 rows=1000 width=8) (actual time=0.106..0.112 rows=59 loops=1)
Sort Key: ((date_trunc('month'::text, dates.d) + '1 mon -1 days'::interval))
Sort Method: quicksort Memory: 27kB
-> Function Scan on generate_series dates (cost=0.01..10.01 rows=1000 width=8) (actual time=0.042..0.097 rows=59 loops=1)
但是,如果我在查询中添加额外的 SUM,那么它会决定使用 dt 上的索引:
EXPLAIN ANALYSE SELECT dt, SUM(value), SUM(other1), SUM(other2), SUM(other3)
FROM example
WHERE dt in (select date_trunc('month', d) + interval '1 month - 1 day'
from generate_series('2012-01-01'::date, '2016-11-10'::date, interval '1 month') dates(d))
GROUP BY dt
HashAggregate (cost=1005765.17..1005780.53 rows=1536 width=61) (actual time=225.249..225.276 rows=56 loops=1)
Group Key: l.as_of
-> Nested Loop (cost=60.27..975250.53 rows=1525732 width=61) (actual time=0.141..173.853 rows=101102 loops=1)
-> Unique (cost=59.84..64.84 rows=200 width=8) (actual time=0.100..0.192 rows=59 loops=1)
-> Sort (cost=59.84..62.34 rows=1000 width=8) (actual time=0.099..0.125 rows=59 loops=1)
Sort Key: ((date_trunc('month'::text, dates.d) + '1 mon -1 days'::interval))
Sort Method: quicksort Memory: 27kB
-> Function Scan on generate_series dates (cost=0.01..10.01 rows=1000 width=8) (actual time=0.031..0.080 rows=59 loops=1)
-> Index Scan using dashboard_loanhistory_95daa586 on dashboard_loanhistory l (cost=0.43..4856.06 rows=1987 width=61) (actual time=0.025..1.579 rows=1714 loops=59)
Index Cond: (as_of = (date_trunc('month'::text, dates.d) + '1 mon -1 days'::interval))
Planning time: 0.228 ms
Execution time: 225.379 ms
这里发生了什么?我希望原始查询使用dt 上的索引运行,我不想不必要地向查询添加额外的聚合。
【问题讨论】:
-
@a_horse_with_no_name,完成(我的说明性示例和我提供的查询计划之间可能存在细微差异,但我相信它们至少对于这些目的足够相似),并且要清楚,仅在其他列上有额外 SUM 的相同查询确实使用索引
-
@a_horse_with_no_name 我的错,已修复
-
真正的问题是
generate_series()的使用。计划者没有估计分配,所以它总是假设1000。也许应该找到更好的方法来找到一个月的最后一天,例如:WHERE date_trunc('month', dt) <> date_trunc('month', dt+ '1 day'::interval) -
@joop 仍然使用您建议的 WHERE 条件进行顺序扫描
-
加入对抗
generate_series()会更好吗?
标签: performance postgresql indexing aggregate-functions