子查询性能提升答案

【问题标题】：Subquery performance improvement子查询性能提升
【发布时间】：2021-08-14 06:29:31
【问题描述】：

我有以下查询，当偏移量越来越高时，我遇到了一些性能问题。

SELECT
c.id,
c.first_name "firstName",
c.last_name "lastName",
c.email "email",
(
    SELECT
      di.income_day
    FROM
      daily_income di
    INNER JOIN person p2 on di.person_id = p2.id
    WHERE
      p2.id = c.id
    ORDER BY di.income_day DESC
    LIMIT 1
) "lastDay"
FROM person c
INNER JOIN person_calorie ca
    ON c.id = ca.person_id
WHERE
c.record_status = true
AND
    c.role = 'patient'
ORDER BY c.number ASC, c.first_name ASC
OFFSET 0
LIMIT 10;

在这里，我试图获取在daily_income 表上注册的最后一天的人员列表。为了存档这个，我使用父 id 创建了一个子查询，基本上得到第二个列表排序它并使用 LIMIT 1。

整个查询都有效，但是当我开始使用 OFFSET +100 获取时，查询开始花费更多时间。现在获取信息大约需要 3sg，我将在 1000 多行的生产环境中使用此查询，所以我担心会太慢。

您能否帮助我解决存档问题或建议如何改进它？

更新

偏移量 = 0

        Limit  (cost=54.24..88681.26 rows=10 width=86) (actual time=27.335..242.011 rows=10 loops=1)
    "  Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status, ca.name, ((SubPlan 1))"
    Buffers: shared hit=79240
    ->  Result  (cost=54.24..1258557.99 rows=142 width=86) (actual time=27.333..242.003 rows=10 loops=1)
    "        Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status, ca.name, (SubPlan 1)"
            Buffers: shared hit=79240
            ->  Sort  (cost=54.24..54.59 rows=142 width=82) (actual time=0.867..0.879 rows=10 loops=1)
    "              Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status, ca.name"
    "              Sort Key: c.number, c.first_name"
                Sort Method: top-N heapsort  Memory: 27kB
                Buffers: shared hit=30
                ->  Hash Join  (cost=30.60..51.17 rows=142 width=82) (actual time=0.325..0.747 rows=136 loops=1)
    "                    Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status, ca.name"
                        Inner Unique: true
                        Hash Cond: (ca.person_id = c.id)
                        Buffers: shared hit=30
                        ->  Seq Scan on public.person_calorie ca  (cost=0.00..18.57 rows=757 width=9) (actual time=0.010..0.149 rows=761 loops=1)
    "                          Output: ca.id, ca.name, ca.vegetable, ca.fruit, ca.cereal, ca.milk, ca.breakfast, ca.lunch, ca.dinner, ca.oil, ca.seed, ca.comments, ca.created_at, ca.updated_at, ca.person_id"
                            Buffers: shared hit=11
                        ->  Hash  (cost=28.76..28.76 rows=147 width=77) (actual time=0.288..0.289 rows=136 loops=1)
    "                          Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status"
                            Buckets: 1024  Batches: 1  Memory Usage: 24kB
                            Buffers: shared hit=19
                            ->  Seq Scan on public.person c  (cost=0.00..28.76 rows=147 width=77) (actual time=0.010..0.220 rows=136 loops=1)
    "                                Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status"
                                    Filter: (c.record_status AND ((c.role)::text = 'patient'::text))
                                    Rows Removed by Filter: 648
                                    Buffers: shared hit=19
            SubPlan 1
            ->  Limit  (cost=8862.69..8862.69 rows=1 width=4) (actual time=24.103..24.104 rows=1 loops=10)
                    Output: di.income_day
                    Buffers: shared hit=79210
                    ->  Sort  (cost=8862.69..8862.95 rows=105 width=4) (actual time=24.099..24.099 rows=1 loops=10)
                        Output: di.income_day
                        Sort Key: di.income_day DESC
                        Sort Method: top-N heapsort  Memory: 25kB
                        Buffers: shared hit=79210
                        ->  Nested Loop  (cost=0.00..8862.16 rows=105 width=4) (actual time=1.141..23.986 rows=403 loops=10)
                                Output: di.income_day
                                Buffers: shared hit=79210
                                ->  Seq Scan on public.person p2  (cost=0.00..28.76 rows=1 width=4) (actual time=0.056..0.109 rows=1 loops=10)
    "                                  Output: p2.id, p2.number, p2.first_name, p2.last_name, p2.cellphone, p2.email, p2.gender, p2.birthday, p2.week, p2.program_know, p2.tuppers, p2.zone, p2.role, p2.other_food, p2.record_status, p2.doctor_id, p2.created_by_id, p2.updated_by_id, p2.deleted_by_id, p2.branch_id, p2.deleted_at, p2.created_at, p2.updated_at"
                                    Filter: (p2.id = c.id)
                                    Rows Removed by Filter: 783
                                    Buffers: shared hit=190
                                ->  Seq Scan on public.daily_income di  (cost=0.00..8832.35 rows=105 width=8) (actual time=1.074..23.791 rows=403 loops=10)
    "                                  Output: di.id, di.income_day, di.amount, di.type, di.has_menu, di.authorized, di.menu, di.record_status, di.person_id, di.sale_id, di.payment_id, di.product_id, di.created_by_id, di.updated_by_id, di.deleted_by_id, di.branch_id, di.deleted_at, di.created_at, di.updated_at"
                                    Filter: (di.person_id = c.id)
                                    Rows Removed by Filter: 73192
                                    Buffers: shared hit=79020
    Planning time: 0.405 ms
    Execution time: 242.111 ms

偏移量 = 120

        Limit  (cost=1063580.54..1152207.57 rows=10 width=86) (actual time=3003.628..3211.188 rows=10 loops=1)
    "  Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status, ca.name, ((SubPlan 1))"
    Buffers: shared hit=1029763
    ->  Result  (cost=56.24..1258560.00 rows=142 width=86) (actual time=38.376..3211.153 rows=130 loops=1)
    "        Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status, ca.name, (SubPlan 1)"
            Buffers: shared hit=1029763
            ->  Sort  (cost=56.24..56.60 rows=142 width=82) (actual time=1.528..1.679 rows=130 loops=1)
    "              Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status, ca.name"
    "              Sort Key: c.number, c.first_name"
                Sort Method: quicksort  Memory: 44kB
                Buffers: shared hit=33
                ->  Hash Join  (cost=30.60..51.17 rows=142 width=82) (actual time=0.643..1.305 rows=136 loops=1)
    "                    Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status, ca.name"
                        Inner Unique: true
                        Hash Cond: (ca.person_id = c.id)
                        Buffers: shared hit=30
                        ->  Seq Scan on public.person_calorie ca  (cost=0.00..18.57 rows=757 width=9) (actual time=0.015..0.224 rows=761 loops=1)
    "                          Output: ca.id, ca.name, ca.vegetable, ca.fruit, ca.cereal, ca.milk, ca.breakfast, ca.lunch, ca.dinner, ca.oil, ca.seed, ca.comments, ca.created_at, ca.updated_at, ca.person_id"
                            Buffers: shared hit=11
                        ->  Hash  (cost=28.76..28.76 rows=147 width=77) (actual time=0.582..0.583 rows=136 loops=1)
    "                          Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status"
                            Buckets: 1024  Batches: 1  Memory Usage: 24kB
                            Buffers: shared hit=19
                            ->  Seq Scan on public.person c  (cost=0.00..28.76 rows=147 width=77) (actual time=0.015..0.466 rows=136 loops=1)
    "                                Output: c.id, c.first_name, c.last_name, c.email, c.role, c.cellphone, c.number, c.gender, c.record_status"
                                    Filter: (c.record_status AND ((c.role)::text = 'patient'::text))
                                    Rows Removed by Filter: 648
                                    Buffers: shared hit=19
            SubPlan 1
            ->  Limit  (cost=8862.69..8862.69 rows=1 width=4) (actual time=24.678..24.679 rows=1 loops=130)
                    Output: di.income_day
                    Buffers: shared hit=1029730
                    ->  Sort  (cost=8862.69..8862.95 rows=105 width=4) (actual time=24.673..24.673 rows=1 loops=130)
                        Output: di.income_day
                        Sort Key: di.income_day DESC
                        Sort Method: top-N heapsort  Memory: 25kB
                        Buffers: shared hit=1029730
                        ->  Nested Loop  (cost=0.00..8862.16 rows=105 width=4) (actual time=6.189..24.595 rows=225 loops=130)
                                Output: di.income_day
                                Buffers: shared hit=1029730
                                ->  Seq Scan on public.person p2  (cost=0.00..28.76 rows=1 width=4) (actual time=0.083..0.118 rows=1 loops=130)
    "                                  Output: p2.id, p2.number, p2.first_name, p2.last_name, p2.cellphone, p2.email, p2.gender, p2.birthday, p2.week, p2.program_know, p2.tuppers, p2.zone, p2.role, p2.other_food, p2.record_status, p2.doctor_id, p2.created_by_id, p2.updated_by_id, p2.deleted_by_id, p2.branch_id, p2.deleted_at, p2.created_at, p2.updated_at"
                                    Filter: (p2.id = c.id)
                                    Rows Removed by Filter: 783
                                    Buffers: shared hit=2470
                                ->  Seq Scan on public.daily_income di  (cost=0.00..8832.35 rows=105 width=8) (actual time=6.093..24.419 rows=225 loops=130)
    "                                  Output: di.id, di.income_day, di.amount, di.type, di.has_menu, di.authorized, di.menu, di.record_status, di.person_id, di.sale_id, di.payment_id, di.product_id, di.created_by_id, di.updated_by_id, di.deleted_by_id, di.branch_id, di.deleted_at, di.created_at, di.updated_at"
                                    Filter: (di.person_id = c.id)
                                    Rows Removed by Filter: 73370
                                    Buffers: shared hit=1027260
    Planning time: 1.422 ms
    Execution time: 3211.318 ms

更新 2

使用新查询和偏移量 0

        Limit  (cost=1254485.43..1254485.46 rows=10 width=57) (actual time=3266.295..3266.301 rows=10 loops=1)
    "  Output: c.id, c.first_name, c.last_name, c.email, di.income_day, c.number"
    Buffers: shared hit=1074838
    ->  Sort  (cost=1254485.43..1254485.79 rows=142 width=57) (actual time=3266.294..3266.298 rows=10 loops=1)
    "        Output: c.id, c.first_name, c.last_name, c.email, di.income_day, c.number"
    "        Sort Key: c.number, c.first_name"
            Sort Method: top-N heapsort  Memory: 27kB
            Buffers: shared hit=1074838
            ->  Nested Loop Left Join  (cost=8864.60..1254482.36 rows=142 width=57) (actual time=24.591..3265.901 rows=136 loops=1)
    "              Output: c.id, c.first_name, c.last_name, c.email, di.income_day, c.number"
                Buffers: shared hit=1074838
                ->  Hash Join  (cost=30.60..51.17 rows=142 width=53) (actual time=0.335..1.366 rows=136 loops=1)
    "                    Output: c.id, c.first_name, c.last_name, c.email, c.number"
                        Inner Unique: true
                        Hash Cond: (ca.person_id = c.id)
                        Buffers: shared hit=30
                        ->  Seq Scan on public.person_calorie ca  (cost=0.00..18.57 rows=757 width=4) (actual time=0.014..0.221 rows=761 loops=1)
    "                          Output: ca.id, ca.name, ca.vegetable, ca.fruit, ca.cereal, ca.milk, ca.breakfast, ca.lunch, ca.dinner, ca.oil, ca.seed, ca.comments, ca.created_at, ca.updated_at, ca.person_id"
                            Buffers: shared hit=11
                        ->  Hash  (cost=28.76..28.76 rows=147 width=53) (actual time=0.301..0.302 rows=136 loops=1)
    "                          Output: c.id, c.first_name, c.last_name, c.email, c.number"
                            Buckets: 1024  Batches: 1  Memory Usage: 20kB
                            Buffers: shared hit=19
                            ->  Seq Scan on public.person c  (cost=0.00..28.76 rows=147 width=53) (actual time=0.013..0.239 rows=136 loops=1)
    "                                Output: c.id, c.first_name, c.last_name, c.email, c.number"
                                    Filter: (c.record_status AND ((c.role)::text = 'patient'::text))
                                    Rows Removed by Filter: 648
                                    Buffers: shared hit=19
                ->  Limit  (cost=8834.00..8834.00 rows=1 width=4) (actual time=23.997..23.997 rows=1 loops=136)
                        Output: di.income_day
                        Buffers: shared hit=1074808
                        ->  Sort  (cost=8834.00..8834.26 rows=105 width=4) (actual time=23.993..23.993 rows=1 loops=136)
                            Output: di.income_day
                            Sort Key: di.income_day DESC
                            Sort Method: top-N heapsort  Memory: 25kB
                            Buffers: shared hit=1074808
                            ->  Seq Scan on public.daily_income di  (cost=0.00..8833.48 rows=105 width=4) (actual time=0.579..23.910 rows=221 loops=136)
                                    Output: di.income_day
                                    Filter: (di.person_id = c.id)
                                    Rows Removed by Filter: 73374
                                    Buffers: shared hit=1074808
    Planning time: 0.334 ms
    Execution time: 3266.392 ms

使用新查询和偏移量 120

        Limit  (cost=1254487.74..1254487.76 rows=10 width=57) (actual time=3301.720..3301.726 rows=10 loops=1)
    "  Output: c.id, c.first_name, c.last_name, c.email, di.income_day, c.number"
    Buffers: shared hit=1074838
    ->  Sort  (cost=1254487.44..1254487.79 rows=142 width=57) (actual time=3301.691..3301.715 rows=130 loops=1)
    "        Output: c.id, c.first_name, c.last_name, c.email, di.income_day, c.number"
    "        Sort Key: c.number, c.first_name"
            Sort Method: quicksort  Memory: 44kB
            Buffers: shared hit=1074838
            ->  Nested Loop Left Join  (cost=8864.60..1254482.36 rows=142 width=57) (actual time=27.048..3301.323 rows=136 loops=1)
    "              Output: c.id, c.first_name, c.last_name, c.email, di.income_day, c.number"
                Buffers: shared hit=1074838
                ->  Hash Join  (cost=30.60..51.17 rows=142 width=53) (actual time=0.275..1.303 rows=136 loops=1)
    "                    Output: c.id, c.first_name, c.last_name, c.email, c.number"
                        Inner Unique: true
                        Hash Cond: (ca.person_id = c.id)
                        Buffers: shared hit=30
                        ->  Seq Scan on public.person_calorie ca  (cost=0.00..18.57 rows=757 width=4) (actual time=0.010..0.216 rows=761 loops=1)
    "                          Output: ca.id, ca.name, ca.vegetable, ca.fruit, ca.cereal, ca.milk, ca.breakfast, ca.lunch, ca.dinner, ca.oil, ca.seed, ca.comments, ca.created_at, ca.updated_at, ca.person_id"
                            Buffers: shared hit=11
                        ->  Hash  (cost=28.76..28.76 rows=147 width=53) (actual time=0.249..0.250 rows=136 loops=1)
    "                          Output: c.id, c.first_name, c.last_name, c.email, c.number"
                            Buckets: 1024  Batches: 1  Memory Usage: 20kB
                            Buffers: shared hit=19
                            ->  Seq Scan on public.person c  (cost=0.00..28.76 rows=147 width=53) (actual time=0.009..0.207 rows=136 loops=1)
    "                                Output: c.id, c.first_name, c.last_name, c.email, c.number"
                                    Filter: (c.record_status AND ((c.role)::text = 'patient'::text))
                                    Rows Removed by Filter: 648
                                    Buffers: shared hit=19
                ->  Limit  (cost=8834.00..8834.00 rows=1 width=4) (actual time=24.258..24.259 rows=1 loops=136)
                        Output: di.income_day
                        Buffers: shared hit=1074808
                        ->  Sort  (cost=8834.00..8834.26 rows=105 width=4) (actual time=24.254..24.254 rows=1 loops=136)
                            Output: di.income_day
                            Sort Key: di.income_day DESC
                            Sort Method: top-N heapsort  Memory: 25kB
                            Buffers: shared hit=1074808
                            ->  Seq Scan on public.daily_income di  (cost=0.00..8833.48 rows=105 width=4) (actual time=0.589..24.171 rows=221 loops=136)
                                    Output: di.income_day
                                    Filter: (di.person_id = c.id)
                                    Rows Removed by Filter: 73374
                                    Buffers: shared hit=1074808
    Planning time: 0.336 ms
    Execution time: 3301.786 ms

【问题讨论】：

在快和慢时共享执行计划，使用：EXPLAIN (ANALYZE, COSTS, VERBOSE, BUFFERS)

标签： sql postgresql query-optimization

【解决方案1】：

当您将偏移量更改为 120 时，它会导致从表 daily_income 中读取 1,027,260 个块。

尝试让我知道将子查询移动到连接部分是否有帮助，我还删除了与人员表的额外连接：

SELECT
c.id,
c.first_name "firstName",
c.last_name "lastName",
c.email "email",
di.income_day "lastDay"
FROM person c
INNER JOIN person_calorie ca
    ON c.id = ca.person_id
left join lateral (
  SELECT  di.income_day 
    FROM daily_income di
    where di.person_id = c.id
    ORDER BY di.income_day DESC
    LIMIT 1
) di on true 
WHERE c.record_status = true
AND c.role = 'patient'
ORDER BY c.number ASC, c.first_name ASC
OFFSET 120
LIMIT 10;

如果您没有关于 daily_income 的索引，请添加此索引：

create index ix_daily_income on daily_income (person_id , income_day)

这些列上的索引也会有所帮助：

person_calorie: person_id , 人：record_status 和 person.role

【讨论】：

时间好像差不多，但是现在如果OFFSET为0也需要同样的时间（3秒+）。
我用新的 EXPLAIN 编辑了问题
@Ellebkey 尽管持续时间很糟糕，但执行计划要简单得多，daily_income 表中的 (person_id , income_day) 列是否有任何索引？查看更新的答案
我不认为我有索引，你也能帮忙吗？我过去没有这样做过。
它成功了，添加索引帮助很大，现在获取数据需要大约 400 毫秒。索引如何使这项工作发挥作用？

【解决方案2】：

首先，确保表已被清理。这压缩表通过摆脱死元组。

查询优化基于以下基本原则。

避免不必要的工作。

在您的尝试中，相关的子查询可能会被完全删除
不需要加入person_calorie，因为您不使用此表中的任何字段。用exists (...) where 条件替换此连接
确保person_calorie.person_id、daily_income.person_id、person.record_status 和person.role 上存在索引将有助于加快连接和过滤操作。但是请注意，如果 postgres 查询计划器决定无论如何都必须扫描整个表，索引可能不会产生任何好处，并且索引会增加写入操作的开销。
根据数据大小，您还可以从person.role 或person.record_status 上的部分索引中受益，因为部分索引更小，因此加载到内存和使用速度更快。

一旦您尝试了这些建议，我很想知道多少这会带来收益。优化后的 query 将是（但请注意，只有在存在 postgresql 可以利用的索引来避免顺序扫描表时才进行优化）：

SELECT
c.id,
c.first_name "firstName",
c.last_name "lastName",
c.email "email",
ld.income_day "lastDay"
FROM person c
LEFT JOIN LATERAL (
  SELECT income_day 
  FROM daily_income di
  WHERE di.person_id = c.id 
  ORDER BY 1 DESC
  LIMIT 1
) ld ON TRUE
WHERE c.record_status = true
  AND c.role = 'patient'
  AND EXISTS (
    SELECT 1 FROM person_calorie ca
    WHERE c.id = ca.person_id
  )
ORDER BY c.number, c.first_name
OFFSET {{ offset_rows }}
LIMIT 10;

现在是房间里的大象，OFFSET。您已经体验过大的偏移值会导致较长的执行时间。这是因为 postgresql 必须执行查询，然后循环遍历结果集，丢弃第一个 N-1 记录以获得 N 的偏移量。

解决此问题的涉及较少的方法可能是尝试在person.number 和person.first_name 上建立索引。这可能（您必须在实现索引并重建表统计信息后通过共享查询执行计划来确认）允许 postgresql 使用索引进行排序。

limit-offset 方法允许您的应用程序访问随机页面。

例如，应用程序端点/api/query_result?page=1000&results_per_page=10 将允许最终用户在偏移量 10000 处获得 10 行。

如果您愿意牺牲随机页面功能，并且只允许最终用户获取下一页，那么您可以改用游标并在每次请求下一页时获取 10 行。 每个活动的最终用户，数据库必须保存一次结果集。如果您的应用程序有固定数量的重度用户（例如内部管理面板），这可能是合适的

如果您不愿意牺牲随机页面功能，那么这将成为一个真正有趣的问题，并且适当的解决方案将与您的用例紧密结合以及您对查询有时会产生结果的容忍度顺序不正确。

您可以构建一个物化视图，为rn 添加一个附加列，并使用rn 而不是offset 使用过滤器。这种方法会导致更快的分页，但是当将行添加到 daily_income 或 person.number 更新或将新记录插入到人员时，您的结果将是陈旧的。

物化视图可以定义为

create materialized view my_matview as
select
 c.id,
 c.first_name "firstName",
 c.last_name "lastName",
 c.email "email",
 MAX(di.income_day) "lastDay"
 ROW_NUMBER () OVER (ORDER BY p.number, p.first_name) - 1 rn,
from person p
join daily_income di
  on p.id = di.person_id
where p.record_status = true
  and p.role = 'patient'
  and exists (
    SELECT 1 FROM person_calorie ca
    WHERE p.id = ca.person_id
  )
group by 1,2,3,4

然后在 my_matview.rn 列上创建索引。

create unique index idx_my_matview_rn on my_matview(rn)

视图可以按需刷新

REFRESH MATERIALIZED VIEW my_matview;

由您决定更新此视图的频率（按计划或通过触发器）

服务于端点的查询可以简单地是

select * from my_matview where rn >= {{ offset_rows }} limit 10

【讨论】：

嗨@Haleemur。非常感谢里面，从上面的答案创建索引是有效的。但我想回答你的一些观点。我确实需要 person_calorie 表中的一个字段，但我从问题中删除了它，我没有意识到这也会影响查询。
从我的应用程序方面来说，是的，我具有使用分页功能从总结果中转到最后记录的功能，它们现在有 1500 多行。如果有必要，我猜他们会愿意为了看到新专栏而避免使用该功能。但是，他们仍然希望在我执行 `c.first_name LIKE '%string%' 的地方保留“搜索选项”，我猜这将是类似的情况。
VACUUM不压缩表。