JSON 值的模式匹配：物化视图上的慢 EXISTS 子查询答案

【问题标题】：Pattern-matching for JSON values: slow EXISTS subquery on materialized viewJSON 值的模式匹配：物化视图上的慢 EXISTS 子查询
【发布时间】：2021-09-12 11:25:44
【问题描述】：

运行 Postgres 12.5 的本地 docker 实例（带有 4MB work_mem）。我正在实现 this pattern 来搜索 json 中的任意字段。目标是快速搜索返回JSON列profile：

CREATE TABLE end_user (
    id varchar NOT NULL,
    environment_id varchar NOT NULL,
    profile jsonb NOT NULL DEFAULT '{}'::jsonb,
    CONSTRAINT end_user_pkey PRIMARY KEY (environment_id, id)
);

CREATE INDEX end_user_environment_id_idx ON private.end_user USING btree (environment_id);
CREATE INDEX end_user_id_idx ON private.end_user USING btree (id);
CREATE INDEX end_user_profile_idx ON private.end_user USING gin (profile);

CREATE MATERIALIZED VIEW user_profiles AS
SELECT u.environment_id, u.id, j.key, j.value 
FROM  end_user u, jsonb_each_text(u.profile) j(key, value);

CREATE UNIQUE INDEX on user_profiles (environment_id, id, key); 
CREATE INDEX user_profile_trgm_idx ON user_profiles using gin (value gin_trgm_ops);

我有这个查询是indexed correctly，所以它在几毫秒内执行超过一百万行。 ✅

select * from user_profiles 
where value ilike '%auckland%' and key = 'timezone' and environment_id = 'test';

执行时间 42ms ??????

Bitmap Heap Scan on user_profiles  (cost=28935.65..62591.44 rows=9659 width=65)                  
  Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))                    
  Filter: ((environment_id)::text = 'test'::text)                                              
  ->  BitmapAnd  (cost=28935.65..28935.65 rows=9659 width=0)                                     
        ->  Bitmap Index Scan on user_profile_trgm_idx  (cost=0.00..2923.95 rows=320526 width=0) 
              Index Cond: (value ~~* '%auckland%'::text)                                         
        ->  Bitmap Index Scan on user_profiles_key_idx  (cost=0.00..26006.62 rows=994408 width=0)
              Index Cond: (key = 'timezone'::text)

但是，如果我将它与 exists 查询一起使用以建立如下条件：

select * users u
where 
   environment_id = 'test'
and exists (
    select 1 from user_profiles p
    where 
       value ilike '%auckland%' 
       and key = 'timezone'
       and p.id = u.id
       and environment_id = 'test'   
)

它的执行速度非常慢。

执行时间 17.44 秒 ????

Nested Loop  (cost=62616.01..124606.45 rows=9658 width=1459) (actual time=19206.818..28444.491 rows=332572 loops=1)                                            
  Buffers: shared hit=952734 read=624101                                                                                                                       
  ->  HashAggregate  (cost=62615.59..62707.52 rows=9193 width=15) (actual time=19205.238..19292.998 rows=332572 loops=1)                                       
        Group Key: (p.id)::text                                                                                                                                
        Buffers: shared hit=373 read=246174                                                                                                                    
        ->  Bitmap Heap Scan on user_profiles p  (cost=28935.65..62591.44 rows=9659 width=15) (actual time=278.211..18942.629 rows=332572 loops=1)             
              Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))                                                                      
              Rows Removed by Index Recheck: 17781109                                                                                                          
              Filter: ((environment_id)::text = 'test'::text)                                                                                                
              Heap Blocks: exact=43928 lossy=197955                                                                                                            
              Buffers: shared hit=373 read=246174                                                                                                              
              ->  BitmapAnd  (cost=28935.65..28935.65 rows=9659 width=0) (actual time=272.626..272.629 rows=0 loops=1)                                         
                    Buffers: shared hit=373 read=4291                                                                                                          
                    ->  Bitmap Index Scan on user_profile_trgm_idx  (cost=0.00..2923.95 rows=320526 width=0) (actual time=177.577..177.577 rows=332572 loops=1)
                          Index Cond: (value ~~* '%auckland%'::text)                                                                                           
                          Buffers: shared hit=373 read=455                                                                                                     
                    ->  Bitmap Index Scan on user_profiles_key_idx  (cost=0.00..26006.62 rows=994408 width=0) (actual time=92.586..92.589 rows=1000000 loops=1)
                          Index Cond: (key = 'timezone'::text)                                                                                                 
                          Buffers: shared read=3836                                                                                                            
  ->  Index Scan using end_user_id_idx on end_user u  (cost=0.42..6.79 rows=1 width=1459) (actual time=0.027..0.027 rows=1 loops=332572)                       
        Index Cond: ((id)::text = (p.id)::text)                                                                                                                
        Filter: ((environment_id)::text = 'test'::text)                                                                                                      
        Buffers: shared hit=952361 read=377927                                                                                                                 
Planning Time: 19.002 ms                                                                                                                                       
Execution Time: 28497.570 ms                                                                                                                                                             |

这是一种耻辱，因为 exists 如果速度快的话会很方便，因为我可以在我的应用程序代码中动态添加更多条件，额外的条件表示为额外的 exists 子句。

顺便说一句，横向连接确实加快了速度，但我不明白我怎么会有这么大的不同：

select * from users u,
lateral (
    select id from user_profiles p
    where 
        value ilike '%auckland%' 
        and key = 'timezone' 
        and environment_id = u.environment_id 
        and p.id = u.id
   ) ss
where u.environment_id = 'test';

执行时间 304ms ??????

Gather  (cost=29936.07..91577.38 rows=9658 width=1474) (actual time=1100.824..15430.620 rows=332572 loops=1)                                                     
  Workers Planned: 2                                                                                                                                             
  Workers Launched: 2                                                                                                                                            
  Buffers: shared hit=1140551 read=436286                                                                                                                        
  ->  Nested Loop  (cost=28936.07..89611.58 rows=4024 width=1474) (actual time=602.490..14805.285 rows=110857 loops=3)                                           
        Buffers: shared hit=1140551 read=436286                                                                                                                  
        ->  Parallel Bitmap Heap Scan on user_profiles p  (cost=28935.65..62492.84 rows=4025 width=22) (actual time=602.078..12247.891 rows=110857 loops=3)      
              Recheck Cond: ((value ~~* '%auckland%'::text) AND (key = 'timezone'::text))                                                                        
              Rows Removed by Index Recheck: 5927036                                                                                                             
              Filter: ((environment_id)::text = 'test'::text)                                                                                                  
              Heap Blocks: exact=14659 lossy=65588                                                                                                               
              Buffers: shared hit=373 read=246174                                                                                                                
              ->  BitmapAnd  (cost=28935.65..28935.65 rows=9659 width=0) (actual time=1087.258..1087.259 rows=0 loops=1)                                         
                    Buffers: shared hit=373 read=4291                                                                                                            
                    ->  Bitmap Index Scan on user_profile_trgm_idx  (cost=0.00..2923.95 rows=320526 width=0) (actual time=853.075..853.076 rows=332572 loops=1)  
                          Index Cond: (value ~~* '%auckland%'::text)                                                                                             
                          Buffers: shared hit=373 read=455                                                                                                       
                    ->  Bitmap Index Scan on user_profiles_key_idx  (cost=0.00..26006.62 rows=994408 width=0) (actual time=231.295..231.295 rows=1000000 loops=1)
                          Index Cond: (key = 'timezone'::text)                                                                                                   
                          Buffers: shared read=3836                                                                                                              
        ->  Index Scan using end_user_id_idx on end_user u  (cost=0.42..6.74 rows=1 width=1459) (actual time=0.022..0.022 rows=1 loops=332572)                   
              Index Cond: ((id)::text = (p.id)::text)                                                                                                            
              Filter: ((environment_id)::text = 'test'::text)                                                                                                  
              Buffers: shared hit=1140178 read=190112                                                                                                            
Planning Time: 16.877 ms                                                                                                                                         
Execution Time: 15461.571 ms

渴望听到关于 exists 子查询为何如此缓慢的任何想法，以及我可以在此处查看的任何其他选项。

Erwin 要求的不同计数，请注意这是测试负载的虚拟数据，但它相当接近生产比率

select count(distinct environment_id)  => 4 
     , count(distinct key)             => 33
     , count(distinct value)           => 15M  
from private.user_profiles;

按照 Erwin 的建议将工作内存增加到 16MB 后更新：

ALTER SYSTEM SET work_mem to '16MB'; SELECT pg_reload_conf();

exists 查询的执行时间为 500 毫秒，情况看起来更好。现在解释一下。

Gather  (cost=3926.79..400754.43 rows=9658 width=1459) (actual time=312.213..9396.610 rows=332572 loops=1)                                                |
  Workers Planned: 2                                                                                                                                      |
  Workers Launched: 2                                                                                                                                     |
  Buffers: shared hit=1141083 read=431918                                                                                                                 |
  ->  Nested Loop  (cost=2926.79..398788.63 rows=4024 width=1459) (actual time=155.271..8987.721 rows=110857 loops=3)                                     |
        Buffers: shared hit=1141083 read=431918                                                                                                           |
        ->  Parallel Bitmap Heap Scan on user_profiles p  (cost=2926.36..371669.88 rows=4025 width=15) (actual time=150.989..2962.870 rows=110857 loops=3)|
              Recheck Cond: (value ~~* '%auckland%'::text)                                                                                                |
              Filter: (((environment_id)::text = 'test'::text) AND (key = 'timezone'::text))                                                            |
              Heap Blocks: exact=82556                                                                                                                    |
              Buffers: shared hit=981 read=241730                                                                                                         |
              ->  Bitmap Index Scan on user_profile_trgm_idx  (cost=0.00..2923.95 rows=320526 width=0) (actual time=243.604..243.605 rows=332572 loops=1) |
                    Index Cond: (value ~~* '%auckland%'::text)                                                                                            |
                    Buffers: shared hit=828                                                                                                               |
        ->  Index Scan using end_user_id_idx on end_user u  (cost=0.42..6.74 rows=1 width=1459) (actual time=0.054..0.054 rows=1 loops=332572)            |
              Index Cond: ((id)::text = (p.id)::text)                                                                                                     |
              Filter: ((environment_id)::text = 'test'::text)                                                                                           |
              Buffers: shared hit=1140102 read=190188                                                                                                     |
Planning Time: 9.932 ms                                                                                                                                   |
Execution Time: 9427.067 ms                                                                                                                               |

【问题讨论】：

尝试禁用 JIT (set jit = off;) 但explain (analyze, buffers) 的输出会比“简单”explain 更有帮助
@a_horse_with_no_name 在这里？？？？？？？已更新 explain (analyze, buffers) 和 set jit off 我需要阅读该内容，因为我不确定当前的作用

标签： postgresql database-design exists jsonpath postgresql-performance

【解决方案1】：

服务器配置

第一个问题在EXPLAIN 输出的这一行中变得很明显：

堆块：精确=14659 有损=65588

lossy 表示您没有足够的 work_mem。您的设置显然非常低。（对于涉及数百万行表的数据库，默认设置 4 MB 太低了。）请参阅：

您可能需要在服务器配置部门做更多工作。而且您似乎通常受到 RAM 的限制。我看到高“读取”计数，这表明冷缓存和/或缓存内存的缺乏或配置错误。

此Postgres Wiki page 可以帮助您入门。

Postgres 12 或更高版本中的 SQL/JSON

My answer you have been working off 已过时。当前的 Postgres 版本是 2015 年 7 月的 9.4！

在 Postgres 12（就像您稍后提交的那样）中，使用 SQL/JSON 中的正则表达式，整个设计可以彻底简单。 The manual:

SQL/JSON 路径表达式允许使用 like_regex 过滤器将文本与正则表达式匹配。

还有索引支持。废弃物化视图。我们需要的只是您的原始表格和如下索引：

CREATE INDEX end_user_path_ops_idx ON end_user USING GIN (profile jsonb_path_ops);

这个查询相当于你原来的，可以使用索引：

SELECT *
FROM   end_user u
WHERE  environment_id = 'test'
AND    profile @? '$.timezone ? (@ like_regex "auck" flag "i")';

db小提琴here

一个缺点是需要习惯 SQL/JSON 路径语言。
延伸阅读：

【讨论】：

@Tim：在您提供所要求的信息后，我又看了一遍。你已经接受了，但我认为答案变得更好了。 :)
@Tim：(profile) 上的 No 索引适用于表达式 eu.profile->>'timezone' ilike '%auckland%'。三元表达式索引将：CREATE INDEX ON end_user USING GIN ((eu.profile->>'timezone') gin_trgm_ops)。请参阅：stackoverflow.com/a/36489851/939860、stackoverflow.com/a/13452528/939860。该特定表达式的速度更快。但是使用 JSON 路径语言和 jsonb_path_ops GIN 索引的建议方法涵盖了完整的 JSON 列，而不仅仅是单个键。
我见过两种方法：jsonb_path_exists(profile, '$.** ? (@ like_regex "auck" flag "i")') 和 profile @? '$.** ? (@ like_regex "auck" flag "i")'。你知道两者是否会得到相同的结果吗？