【发布时间】:2021-05-26 10:38:48
【问题描述】:
表结构
我有一个 Postgres 表 T 有 3 列 (w1, w2, occurrences),所有三个都是整数,并且 w1 和 w2 正在引用另一个表.所以基本上,有两个外键和一个值。
w1 和 w2 可以在每一行中变成大约 1500 万个唯一值(索引)。目前,T 包含大约 18 亿行,它们或多或少是具有限制的索引的排列,即 - 如果您考虑对称矩阵 - 可能只存在一个值对。例如,可能有(252, 13, x),但没有(13, 252, x)。但是,没有排序,所以(5, 900, x) 也可能在 T 中(排序在插入期间完成,取决于引用表中的值)。这些元组 (w1, w2) 是唯一的。
目前,该表上有 3 个不同的索引,UNIQUE INDEX tdc_idx_1 on T (w1, w2)、INDEX tdc_idx_w1 on T (w1) 和 INDEX tdc_idx_w2 on T (w2)。
问题陈述
基本上,我想运行两个不同的查询。对我来说,困难是找出糟糕的性能来自哪里,或者在考虑到查询的“复杂性”和表大小的情况下接受运行时(我猜情况并非如此......)。简而言之,我可能需要查询结构方面的帮助,并且可能需要处理表索引(或者可能是一般的表设计)。
我想要的结果最直接的查询是
-- A (the OR-query)
SELECT w1, w2, occurrences FROM public.T
WHERE (w1 in (123, 555, 999) OR w2 in (123, 555, 999) AND occurrences > 1;
和
-- B (the AND-query)
SELECT w1, w2, occurrences FROM public.T
WHERE (w1 in (123, 555, 999) AND w2 in (123, 555, 999) AND occurrences > 1;
分别(注意 OR/AND)。
现在我将给出一些性能指标和一些我使用ANALYZE EXPLAIN 提出的替代查询。
性能
查询 A
-- A (OR)
EXPLAIN ANALYSE
SELECT w1, w2, occurrences
FROM public.T
WHERE (w1 in (123, 555, 999) OR w2 in (123, 555, 999)) AND occurrences > 1
成功运行。总查询运行时间:29 分 19 秒。
173071 行受到影响。 (没有解释分析的实际结果)
Gather (cost=3378.13..13984524.32 rows=50262 width=12) (actual time=144.749..1751127.665 rows=173071 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Parallel Bitmap Heap Scan on T (cost=2378.13..13978498.12 rows=20942 width=12) (actual time=154.140..1750873.360 rows=57690 loops=3)
Recheck Cond: ((w1 = ANY ('{123,555,999}'::integer[])) OR (w2 = ANY ('{123,555,999}'::integer[])))
Rows Removed by Index Recheck: 20069348
Filter: (occurences > 1)
Rows Removed by Filter: 110878
Heap Blocks: exact=902 lossy=107950
-> BitmapOr (cost=2378.13..2378.13 rows=214069 width=0) (actual time=114.320..114.335 rows=0 loops=1)
-> Bitmap Index Scan on tdc_idx_w1 (cost=0.00..1631.03 rows=148439 width=0) (actual time=17.100..17.103 rows=166301 loops=1)
Index Cond: (w1 = ANY ('{123,555,999}'::integer[]))
-> Bitmap Index Scan on tdc_idx_w2 (cost=0.00..721.96 rows=65630 width=0) (actual time=97.208..97.211 rows=339406 loops=1)
Index Cond: (w2 = ANY ('{123,555,999}'::integer[]))
Planning Time: 0.280 ms
JIT:
Functions: 12
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 2.689 ms, Inlining 117.210 ms, Optimization 36.646 ms, Emission 28.227 ms, Total 184.772 ms
Execution Time: 1751571.992 ms
查询 B
-- B (AND)
EXPLAIN ANALYSE
SELECT w1, w2, occurrences
FROM public.T
WHERE (w1 in (123, 555, 999) AND w2 in (123, 555, 999)) AND occurrences > 1
成功运行。总查询运行时间:1 秒 716 毫秒。
3 行受影响。 (没有解释分析的实际结果)
Index Scan using tdc_idx_1 on T (cost=0.58..61.13 rows=1 width=12) (actual time=88.895..239.394 rows=3 loops=1)
Index Cond: ((w1 = ANY ('{123,555,999}'::integer[])) AND (w2 = ANY ('{123,555,999}'::integer[])))
Filter: (occurences > 1)
Planning Time: 51.256 ms
Execution Time: 239.518 ms
查询备选方案 1
-- A1 (OR)
EXPLAIN ANALYSE
SELECT w1, w2, occurences
FROM public.token_doc_cooccurrences
INNER JOIN (
VALUES (123), (555), (999)
) vals(v)
ON (w1 = v)
WHERE occurences > 1
UNION
SELECT w1, w2, occurences
FROM public.token_doc_cooccurrences
INNER JOIN (
VALUES (123), (555), (999)
) vals2(w)
ON (w2 = w)
WHERE occurences > 1;
成功运行。总查询运行时间:29 分 28 秒。
173071 行受到影响。 (没有解释分析的实际结果)
HashAggregate (cost=841679.67..842197.01 rows=51734 width=12) (actual time=1755066.279..1755315.668 rows=173071 loops=1)
Group Key: T.w1, T.w2, T.occurences
Batches: 5 Memory Usage: 4145kB Disk Usage: 6920kB
-> Append (cost=561.79..841291.66 rows=51734 width=12) (actual time=1088.188..1753999.477 rows=173074 loops=1)
-> Nested Loop (cost=561.79..579176.31 rows=35898 width=12) (actual time=1088.181..75981.142 rows=55822 loops=1)
-> Values Scan on *VALUES* (cost=0.00..0.04 rows=3 width=4) (actual time=0.006..0.032 rows=3 loops=1)
-> Bitmap Heap Scan on T (cost=561.79..192939.10 rows=11966 width=12) (actual time=407.806..25231.892 rows=18607 loops=3)
Recheck Cond: (w1 = *VALUES*.column1)
Filter: (occurences > 1)
Rows Removed by Filter: 36826
Heap Blocks: exact=10630
-> Bitmap Index Scan on tdc_idx_w1 (cost=0.00..558.80 rows=50963 width=0) (actual time=74.175..74.175 rows=55434 loops=3)
Index Cond: (w1 = *VALUES*.column1)
-> Nested Loop (cost=250.51..261339.35 rows=15836 width=12) (actual time=217.234..1677133.291 rows=117252 loops=1)
-> Values Scan on *VALUES*_1 (cost=0.00..0.04 rows=3 width=4) (actual time=0.020..0.042 rows=3 loops=1)
-> Bitmap Heap Scan on T T_1 (cost=250.51..87060.31 rows=5279 width=12) (actual time=143.454..558815.229 rows=39084 loops=3)
Recheck Cond: (w2 = *VALUES*_1.column1)
Rows Removed by Index Recheck: 19099252
Filter: (occurences > 1)
Rows Removed by Filter: 74051
Heap Blocks: exact=18304 lossy=311451
-> Bitmap Index Scan on tdc_idx_w2 (cost=0.00..249.19 rows=22482 width=0) (actual time=122.725..122.725 rows=113135 loops=3)
Index Cond: (w2 = *VALUES*_1.column1)
Planning Time: 68.676 ms
JIT:
Functions: 21
Options: Inlining true, Optimization true, Expressions true, Deforming true
Timing: Generation 5.749 ms, Inlining 219.374 ms, Optimization 370.011 ms, Emission 360.728 ms, Total 955.862 ms
Execution Time: 1756917.366 ms
查询 B 备选方案 1
-- B1 (AND)
EXPLAIN ANALYSE
SELECT w1, w2, occurences
FROM public.token_doc_cooccurrences
INNER JOIN (
VALUES (123), (555), (999)
) vals(v)
ON (w1 = v)
INNER JOIN (
VALUES (123), (555), (999)
) vals2(w)
ON (w2 = w)
WHERE occurences > 1;
成功运行。总查询运行时间:1 秒 5 毫秒。
3 行受影响。 (没有解释分析的实际结果)
Nested Loop (cost=0.58..77.73 rows=1 width=12) (actual time=130.157..295.939 rows=3 loops=1)
-> Values Scan on ""*VALUES*_1"" (cost=0.00..0.04 rows=3 width=4) (actual time=0.005..0.019 rows=3 loops=1)
-> Nested Loop (cost=0.58..25.87 rows=3 width=12) (actual time=54.355..98.622 rows=1 loops=3)
-> Values Scan on ""*VALUES*"" (cost=0.00..0.04 rows=3 width=4) (actual time=0.003..0.017 rows=3 loops=3)
-> Index Scan using tdc_idx_1 on T (cost=0.58..8.60 rows=1 width=12) (actual time=32.853..32.855 rows=0 loops=9)
Index Cond: ((w1 = ""*VALUES*"".column1) AND (w2 = ""*VALUES*_1"".column1))
Filter: (occurences > 1)
Planning Time: 59.447 ms
Execution Time: 296.042 ms
信息
在所有测量之间,我这样做了to clear cached queries:
systemctl stop postgresql
sync
echo 3 > /proc/sys/vm/drop_caches
systemctl start postgresql
== 编辑 ==
按照@jjanes 的建议,我添加了额外的索引(w1, occurrences, w2) 和(w2, occurrences, w1) 并运行VACUUM。
另外,我设置了work_mem = 16MB。之后我再次清除缓存并再次运行 Query A Alternative 1。结果如下:
HashAggregate (cost=3535.61..4052.95 rows=51734 width=12) (actual time=3240.753..3471.160 rows=173071 loops=1)
Group Key: token_doc_cooccurrences.w1, token_doc_cooccurrences.w2, token_doc_cooccurrences.occurences
Batches: 1 Memory Usage: 14353kB
Buffers: shared hit=152549 read=993
-> Append (cost=0.58..3147.60 rows=51734 width=12) (actual time=196.966..2927.069 rows=173074 loops=1)
Buffers: shared hit=152549 read=993
-> Nested Loop (cost=0.58..1642.71 rows=35898 width=12) (actual time=196.960..1948.256 rows=55822 loops=1)
Buffers: shared hit=46960 read=533
-> Values Scan on ""*VALUES*"" (cost=0.00..0.04 rows=3 width=4) (actual time=0.007..0.024 rows=3 loops=1)
-> Index Only Scan using w1_occ_w2 on token_doc_cooccurrences (cost=0.58..427.90 rows=11966 width=12) (actual time=85.026..603.453 rows=18607 loops=3)
Index Cond: ((w1 = ""*VALUES*"".column1) AND (occurences > 1))
Heap Fetches: 0
Buffers: shared hit=46960 read=533
-> Nested Loop (cost=0.58..728.88 rows=15836 width=12) (actual time=40.764..573.493 rows=117252 loops=1)
Buffers: shared hit=105589 read=460
-> Values Scan on ""*VALUES*_1"" (cost=0.00..0.04 rows=3 width=4) (actual time=0.008..0.027 rows=3 loops=1)
-> Index Only Scan using w2_occ_w1 on token_doc_cooccurrences token_doc_cooccurrences_1 (cost=0.58..190.16 rows=5279 width=12) (actual time=30.809..99.922 rows=39084 loops=3)
Index Cond: ((w2 = ""*VALUES*_1"".column1) AND (occurences > 1))
Heap Fetches: 0
Buffers: shared hit=105589 read=460
Planning:
Buffers: shared hit=121 read=13
Planning Time: 165.009 ms
Execution Time: 3672.053 ms
就个人而言,我不知道设置或仅索引扫描分别有什么影响,但结果很棒!
Successfully run. Total query runtime: 6 secs 122 msec.
173071 rows affected.
~6 秒,而将近 30 分钟。
【问题讨论】:
标签: postgresql indexing