PostgreSQL 使用 pg_trgm 比全扫描慢答案

【问题标题】：PostgreSQL using pg_trgm slower then full scanPostgreSQL 使用 pg_trgm 比全扫描慢
【发布时间】：2019-08-28 00:56:43
【问题描述】：

我在玩pg_trgm 扩展，我有点困惑。这是会议：

postgres=# create table t(i int, x text);
CREATE TABLE
postgres=# insert into t select i, random()::text from generate_series(1,50000000) as i;
INSERT 0 50000000
postgres=# explain analyze select * from t where x ilike '%666666%';
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..531870.29 rows=12954 width=36) (actual time=131.436..11408.176 rows=432 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t  (cost=0.00..529574.89 rows=5398 width=36) (actual time=108.771..11304.946 rows=144 loops=3)
         Filter: (x ~~* '%666666%'::text)
         Rows Removed by Filter: 16666523
 Planning Time: 0.121 ms
 Execution Time: 11408.279 ms
(8 rows)

postgres=# explain analyze select * from t where x ilike '%666666%';
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..580654.94 rows=5000 width=21) (actual time=124.986..11070.983 rows=432 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on t  (cost=0.00..579154.94 rows=2083 width=21) (actual time=72.207..11010.876 rows=144 loops=3)
         Filter: (x ~~* '%666666%'::text)
         Rows Removed by Filter: 16666523
 Planning Time: 0.283 ms
 Execution Time: 11071.065 ms
(8 rows)

postgres=# create index i on t using gin (x gin_trgm_ops);
CREATE INDEX
postgres=# analyze t;
ANALYZE
postgres=# explain analyze select * from t where x ilike '%666666%';
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=54.75..18107.93 rows=5000 width=21) (actual time=116.114..26995.773 rows=432 loops=1)
   Recheck Cond: (x ~~* '%666666%'::text)
   Rows Removed by Index Recheck: 36257910
   Heap Blocks: exact=39064 lossy=230594
   ->  Bitmap Index Scan on i  (cost=0.00..53.50 rows=5000 width=0) (actual time=75.363..75.363 rows=592216 loops=1)
         Index Cond: (x ~~* '%666666%'::text)
 Planning Time: 0.389 ms
 Execution Time: 26996.429 ms
(8 rows)

postgres=# explain analyze select * from t where x ilike '%666666%';
                                                     QUERY PLAN                                                      
---------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=54.75..18107.93 rows=5000 width=21) (actual time=128.859..29231.765 rows=432 loops=1)
   Recheck Cond: (x ~~* '%666666%'::text)
   Rows Removed by Index Recheck: 36257910
   Heap Blocks: exact=39064 lossy=230594
   ->  Bitmap Index Scan on i  (cost=0.00..53.50 rows=5000 width=0) (actual time=79.147..79.147 rows=592216 loops=1)
         Index Cond: (x ~~* '%666666%'::text)
 Planning Time: 0.252 ms
 Execution Time: 29231.945 ms
(8 rows)

如您所见，没有索引的查询比使用索引快两倍以上。就目前而言，有默认的 PostgreSQL 设置（共享缓冲区、工作内存等）

我错过了什么？

PS：x86_64-pc-linux-gnu 上的 PostgreSQL 11.5 (Ubuntu 11.5-1.pgdg18.04+1)，由 gcc (Ubuntu 7.4.0-1ubuntu1~18.04.1) 7.4.0 编译，64 位

PPS：使用gist 索引会更慢。

【问题讨论】：

标签： postgresql

【解决方案1】：

tldr：三元组可能不擅长搜索由重复 N 次的单个字符组成的模式（例如 666666），因为只有 1 个非终结三元组并且可能出现在搜索空间中。

当使用 gin-index 时，行的位图太大而无法放入内存，因此它存储对页面的引用，并且数据库必须对这些页面执行进一步的重新检查扫描。如果重新检查的页面数量很少，则使用索引仍然是有益的，但是在重新检查页面数量较多的情况下，索引的性能很差。解释输出中的以下几行突出显示了这一点

   Recheck Cond: (x ~~* '%666666%'::text)
   Rows Removed by Index Recheck: 36257910
   Heap Blocks: exact=39064 lossy=230594

问题是您的搜索字符串所特有的，即666666，与测试数据相关。

如果你运行select pg_trgm('666666')，你会发现：

        show_trgm        
-------------------------
 {"  6"," 66","66 ",666}
(1 row)

前 3 个三元组甚至不会在类似的上下文中生成（用户 jjanes 建议更正）。在索引上搜索会产生所有包含666 的页面。您可以通过使用... ilike '%666%' 运行解释分析查询来验证这一点，并获得与上述相同的Heap Blocks 输出。

如果您使用模式123456 进行搜索，您会发现它的性能要好得多，因为它会生成一组更大的三元组进行搜索：

              show_trgm              
-------------------------------------
 {"  1"," 12",123,234,345,456,"56 "}
(1 row)

在我的机器上，我得到以下信息：

|------------------------------------|
| pattern | pages rechecked          |
|         | exact | lossy  | total   |
|------------------------------------|
| 123456  |   600 |        |    600  |
| 666666  | 39454 | 230592 | 270046* |
|    666  | 39454 | 230592 | 270046* |
|------------------------------------|
*this is rougly 85% of the total # of pages used for the table 't'

这是解释输出：

postgres=> explain analyze select * from t where x ~ '123456';
                                                        QUERY PLAN                                                        
--------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=90.75..18143.92 rows=5000 width=22) (actual time=110.962..113.509 rows=518 loops=1)
   Recheck Cond: (x ~ '123456'::text)
   Rows Removed by Index Recheck: 83
   Heap Blocks: exact=600
   ->  Bitmap Index Scan on t_x_idx  (cost=0.00..89.50 rows=5000 width=0) (actual time=110.868..110.868 rows=601 loops=1)
         Index Cond: (x ~ '123456'::text)
 Planning time: 0.703 ms
 Execution time: 113.564 ms
(8 rows)

postgres=> explain analyze select * from t where x ~ '666666';
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=54.75..18107.92 rows=5000 width=22) (actual time=137.143..18111.609 rows=462 loops=1)
   Recheck Cond: (x ~ '666666'::text)
   Rows Removed by Index Recheck: 36258389
   Heap Blocks: exact=39454 lossy=230592
   ->  Bitmap Index Scan on t_x_idx  (cost=0.00..53.50 rows=5000 width=0) (actual time=105.962..105.962 rows=593708 loops=1)
         Index Cond: (x ~ '666666'::text)
 Planning time: 0.420 ms
 Execution time: 18111.739 ms
(8 rows)

postgres=> explain analyze select * from t where x ~ '666';
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on t  (cost=54.75..18107.92 rows=5000 width=22) (actual time=102.813..17285.086 rows=593708 loops=1)
   Recheck Cond: (x ~ '666'::text)
   Rows Removed by Index Recheck: 35665143
   Heap Blocks: exact=39454 lossy=230592
   ->  Bitmap Index Scan on t_x_idx  (cost=0.00..53.50 rows=5000 width=0) (actual time=96.100..96.100 rows=593708 loops=1)
         Index Cond: (x ~ '666'::text)
 Planning time: 0.500 ms
 Execution time: 17300.440 ms
(8 rows)

【讨论】：

"我们知道没有行会匹配前 3 个三元组，因为您的测试数据中不存在空格。"即使他在测试数据中确实有空格，也不会被搜索到。在 ILIKE 上下文中，不会生成带有空格的三元组。（show_trgm 不知道上下文，所以它没有显示这个事实）。
HaleemurAli 感谢您的回答，问题在它之后变得明显。但是，您回答了“为什么？”这个问题。但不是“如何？”似乎 PostgreSQL 计划者对是否使用索引做出了错误的决定。还有@LaurenzAlbe
@Abelisto 这隐藏在patternsel_common 的内部，我很难猜到。但由于估计值正好是 0.01%，看来有一些经验法则在起作用。估计值是否因不同模式而异？

【解决方案2】：

您已经有了一个很好的答案，它解释了为什么 '%666666%' 几乎是 pg_trgm 与您的样本数据的最坏情况。

很难说这种最坏的情况是否是“公平”的测试。有时最坏的情况是不可避免的并且对性能很敏感。如果您是这种情况，那么也许这是一个公平的测试。另一方面，担心性能恶魔查询而不是实际查询通常是浪费时间。

但是您可以采取一些措施来提高最坏情况下的性能。

堆块：精确=39064 有损=230594

这里的有损块对性能来说是可怕的。如果你增加“work_mem”直到它们消失，它可能会缩小索引和 seq 扫描之间的大部分差距，甚至可能会逆转它。而且它不需要很大的设置，在我手中20MB就足够了。在现代服务器上，这是一个相当保守的设置。

如果您的表大于 RAM 中的缓存容量，那么您将花费大量时间从磁盘读取数据。如果是这种情况，增加“effective_io_concurrency”的设置也可能有助于缩小索引的最坏情况使用与 seq 扫描之间的差距。

要知道的另一件事是 seq 扫描使用 2 个并行工作程序。因此，虽然它完成的速度是原来的两倍，但它可能会使用 3 倍的资源来完成它。（我不明白为什么索引不使用并行位图扫描——我认为它是合格的）

如果您可以使索引的最坏情况使用与 seq 扫描大致相同，并且平均情况要好得多，那么您已经遥遥领先。

【讨论】：