为什么 jsonb 列上的 Postgres 查找如此缓慢？答案

【问题标题】：Why are Postgres lookups on jsonb columns so slow?为什么 jsonb 列上的 Postgres 查找如此缓慢？
【发布时间】：2017-04-22 16:50:06
【问题描述】：

我有一个表targeting，其中有一列marital_status，类型为text[]，另一列data，类型为jsonb。这两列的内容是一样的，只是格式不同（只是为了演示）。示例数据：

 id |      marital_status      |                        data                       
----+--------------------------+---------------------------------------------------
  1 | null                     | {}
  2 | {widowed}                | {"marital_status": ["widowed"]}
  3 | {never_married,divorced} | {"marital_status": ["never_married", "divorced"]}
...

表中有超过690K条记录随机组合。

在 text[] 列上查找

EXPLAIN ANALYZE SELECT marital_status
FROM targeting
WHERE marital_status @> '{widowed}'::text[]

无索引

通常需要

Seq Scan on targeting  (cost=0.00..172981.38 rows=159061 width=28) (actual time=0.017..840.084 rows=158877 loops=1)
  Filter: (marital_status @> '{widowed}'::text[])
  Rows Removed by Filter: 452033
Planning time: 0.150 ms
Execution time: 845.731 ms

带索引

使用索引通常需要

CREATE INDEX targeting_marital_status_idx ON targeting ("marital_status");

结果：

Index Only Scan using targeting_marital_status_idx on targeting  (cost=0.42..23931.35 rows=159061 width=28) (actual time=3.528..143.848 rows=158877 loops=1)"
  Filter: (marital_status @> '{widowed}'::text[])
  Rows Removed by Filter: 452033
  Heap Fetches: 0
Planning time: 0.217 ms
Execution time: 148.506 ms

在 jsonb 列上查找

EXPLAIN ANALYZE SELECT data
FROM targeting
WHERE (data -> 'marital_status') @> '["widowed"]'::jsonb

无索引

通常需要

Seq Scan on targeting  (cost=0.00..174508.65 rows=611 width=403) (actual time=0.095..5399.112 rows=158877 loops=1)
  Filter: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
  Rows Removed by Filter: 452033
Planning time: 0.172 ms
Execution time: 5408.326 ms

带索引

使用索引通常需要

CREATE INDEX targeting_data_marital_status_idx ON targeting USING GIN ((data->'marital_status'));

结果：

Bitmap Heap Scan on targeting  (cost=144.73..2482.75 rows=611 width=403) (actual time=85.966..3694.834 rows=158877 loops=1)
  Recheck Cond: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
  Rows Removed by Index Recheck: 201080
  Heap Blocks: exact=33723 lossy=53028
  ->  Bitmap Index Scan on targeting_data_marital_status_idx  (cost=0.00..144.58 rows=611 width=0) (actual time=78.851..78.851 rows=158877 loops=1)"
        Index Cond: ((data -> 'marital_status'::text) @> '["widowed"]'::jsonb)
Planning time: 0.257 ms
Execution time: 3703.492 ms

问题

为什么text[] 列的性能如此出色，即使不使用索引也是如此？
为什么向jsonb 列添加索引只能将性能提高 35%？
有没有更高效的方法来查找jsonb 列？

【问题讨论】：

一个不同之处在于返回的数据。在一个上选择文本，在另一个上选择 jsonb。用SELECT 1 FROM ... 运行它们怎么样，所以输出完全相同。有什么影响吗？
GIN 索引的效率通常低于 b-tree，因此可以预期。令我惊讶的是没有索引的速度有多慢。是否将所有时间都花在 CPU 上？

标签： postgresql indexing postgresql-9.4 jsonb indices

【解决方案1】：

似乎是一个简单的问题。本质上你是在问怎么来的，

CREATE TABLE foo ( id int, key1 text );

比

快

CREATE TABLE bar ( id int, jsonb foo );

@Craig 在评论中回答了这个问题

GIN 索引的效率通常低于 b-tree，因此可以预期。

该架构中的空值也应为

SELECT jsonb_build_object('marital_status',ARRAY[null]);
     jsonb_build_object     
----------------------------
 {"marital_status": [null]}
(1 row)

而不是{}。 PostgreSQL 采用了许多快捷方式来快速更新 jsonb 对象，并节省索引空间。

如果这些都没有意义，请查看这个伪表。

CREATE TABLE foo ( id int, x text, y text, z text )
CREATE INDEX ON foo(x);
CREATE INDEX ON foo(y);
CREATE INDEX ON foo(z);

这里我们在表上有三个 btree 索引。让我们看一个类似的表..

CREATE TABLE bar ( id int, junk jsonb );
CREATE INDEX ON bar USING gin (junk);
INSERT INTO bar (id,junk) VALUES (1,$${"x": 10, "y": 42}$$);

要使bar 像foo 一样执行，我们需要两个btree，这两个btree 都单独大于我们拥有的单个GIN 索引。如果你这样做了

INSERT INTO bar (id,junk) VALUES (1,$${"x": 10, "y": 42, "z":3}$$);

我们必须在z 上建立另一个 btree 索引，这又将是巨大的。你可以看到我要去哪里。 jsonb 很棒，但索引和模式建模的复杂性与数据库并不平行。您不能只将数据库简化为 jsonb 列，发出 CREATE INDEX 并期望获得相同的性能。

【讨论】：

【解决方案2】：

这可能是使用 jsonb_ops（默认 GIN 索引策略）而不是 jsonb_path_ops 的问题。

根据文档： https://www.postgresql.org/docs/9.6/static/datatype-json.html

虽然jsonb_path_ops 运算符类仅支持使用@> 运算符的查询，但它比默认运算符类jsonb_ops 具有显着的性能优势。对于相同的数据，jsonb_path_ops 索引通常比jsonb_ops 索引小得多，并且搜索的特异性更好，特别是当查询包含在数据中频繁出现的键时。因此，搜索操作通常比使用默认操作符类执行得更好。

jsonb_ops 和jsonb_path_ops GIN 索引的技术区别在于前者为数据中的每个键和值创建独立的索引项，而后者只为数据中的每个值创建索引项。 [1] 基本上，每个 jsonb_path_ops 索引项都是值和指向它的键的哈希值；例如，要索引{"foo": {"bar": "baz"}}，将创建一个索引项，将 foo、bar 和 baz 的所有三个合并到哈希值中。因此，查找此结构的包含查询将导致极其具体的索引搜索；但是根本没有办法找出 foo 是否作为键出现。另一方面，jsonb_ops 索引将创建三个分别代表 foo、bar 和 baz 的索引项；然后进行包含查询，它将查找包含所有这三个项目的行。虽然 GIN 索引可以相当有效地执行这种 AND 搜索，但它仍然比等效的 jsonb_path_ops 搜索更不具体且速度较慢，尤其是当有大量行包含三个索引项中的任何一个时。

【讨论】：