生产服务器上的事务慢 20 倍答案

【问题标题】：Transaction is 20x slower on production server生产服务器上的事务慢 20 倍
【发布时间】：2019-03-29 17:19:04
【问题描述】：

我的一个开发服务器测试事务（一系列更新等）在大约 2 分钟内运行。在生产服务器上大约需要 25 分钟。

服务器读取文件并插入记录。它开始很快，但随着它的进展变得越来越慢。插入的每条记录都有一个聚合表更新，并且该更新逐渐变慢。该聚合更新确实会查询正在使用插入写入的表。

配置仅在 max_worker_processes (development 8, prod 16), shared_buffers (dev 128MB, prod 512MB), wal_buffers (Dev 4MB, prod 16MB)中有所不同。

我尝试调整了一些配置，还转储了整个数据库并重新执行了 initdb，以防万一它没有正确升级（到 9.6）。没有任何效果。

我希望有这方面经验的人能告诉我要寻找什么。

编辑：在收到一些 cmets 后，我能够弄清楚发生了什么并着手解决问题，但我认为必须有更好的方法。首先发生的事情是这样的：

起初表中没有相关索引的数据，postgresql制定了这个计划。请注意，表中的数据与相关的“businessIdentifier”索引或“transactionNumber”无关。

 Aggregate  (cost=16.63..16.64 rows=1 width=4) (actual time=0.031..0.031 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.63 rows=1 width=4) (actual time=0.028..0.028 rows=0 loops=1)
         ->  Index Scan using transactionlinedateindex on "transactionLine" ed  (cost=0.29..8.31 rows=1 width=5) (actual time=0.028..0.028 rows=0 loops=1)
               Index Cond: ((("businessIdentifier")::text = '36'::text) AND ("reconciliationNumber" = 4519))
         ->  Index Scan using transaction_pkey on transaction eh  (cost=0.29..8.31 rows=1 width=9) (never executed)
               Index Cond: ((("businessIdentifier")::text = '36'::text) AND (("transactionNumber")::text = (ed."transactionNumber")::text))
               Filter: ("transactionStatus" = 'posted'::"transactionStatusItemType")
 Planning time: 0.915 ms
 Execution time: 0.100 ms

然后随着数据的插入，它就变成了一个非常糟糕的计划。在这个例子中是 474 毫秒。它需要执行数千次，具体取决于上传的内容，因此 474 毫秒是不好的。

 Aggregate  (cost=16.44..16.45 rows=1 width=4) (actual time=474.222..474.222 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.44 rows=1 width=4) (actual time=474.218..474.218 rows=0 loops=1)
         Join Filter: ((eh."transactionNumber")::text = (ed."transactionNumber")::text)
         ->  Index Scan using transaction_pkey on transaction eh  (cost=0.29..8.11 rows=1 width=9) (actual time=0.023..0.408 rows=507 loops=1)
               Index Cond: (("businessIdentifier")::text = '37'::text)
               Filter: ("transactionStatus" = 'posted'::"transactionStatusItemType")
         ->  Index Scan using transactionlineprovdateindex on "transactionLine" ed  (cost=0.29..8.31 rows=1 width=5) (actual time=0.934..0.934 rows=0 loops=507)
               Index Cond: (("businessIdentifier")::text = '37'::text)
               Filter: ("reconciliationNumber" = 4519)
               Rows Removed by Filter: 2520
 Planning time: 0.848 ms
 Execution time: 474.278 ms

真空分析修复它。但是在事务提交之前，您不能运行 Vacuum 分析。在 Vacuum 分析后，postgresql 使用了不同的计划，它又回到了 0.1 毫秒。

 Aggregate  (cost=16.63..16.64 rows=1 width=4) (actual time=0.072..0.072 rows=1 loops=1)
   ->  Nested Loop  (cost=0.57..16.63 rows=1 width=4) (actual time=0.069..0.069 rows=0 loops=1)
         ->  Index Scan using transactionlinedateindex on "transactionLine" ed  (cost=0.29..8.31 rows=1 width=5) (actual time=0.067..0.067 rows=0 loops=1)
               Index Cond: ((("businessIdentifier")::text = '37'::text) AND ("reconciliationNumber" = 4519))
         ->  Index Scan using transaction_pkey on transaction eh  (cost=0.29..8.31 rows=1 width=9) (never executed)
               Index Cond: ((("businessIdentifier")::text = '37'::text) AND (("transactionNumber")::text = (ed."transactionNumber")::text))
               Filter: ("transactionStatus" = 'posted'::"transactionStatusItemType")
 Planning time: 1.134 ms
 Execution time: 0.141 ms

我的解决方法是在大约 100 次插入后提交，然后运行 Vacuum 分析，然后继续。唯一的问题是，如果其余数据中的某些内容失败并回滚，仍然会插入 100 条记录。

有没有更好的方法来处理这个问题？我应该只升级到版本 10 或 11 还是 postgresql 会有帮助吗？

【问题讨论】：

请向我们展示这两个查询的EXPLAIN (ANALYZE, BUFFERS) 输出。
还要检查服务器是否忙于做其他事情，并且确实拥有配置所承诺的内存和 CPU。
桌子有多大？他们有适当的索引吗？通过增加现有行或重新计算总数，更新是如何工作的？如果随着数据的添加而性能下降，这可能意味着服务器必须扫描整个表来计算总数
Execution time: 4.045 ms 你想要多快？
包括显示 0.5 秒的解释

标签： postgresql performance

【解决方案1】：

对于插入的每条记录都有一个聚合表更新，并且该更新逐渐变慢。

这是一个想法：将工作流程更改为 (1) 使用 COPY 接口将外部数据导入表中，(2) 索引并分析该数据，(3) 使用所有必需的连接/分组运行最终更新实际转换和更新聚合表。

如果需要，所有这些都可以在一个长事务中完成。

只有当整个事情锁定一些重要的数据库对象太久时，您才应该考虑将其拆分为单独的事务/批次（处理以某种通用方式分区的数据，按日期/时间或按 ID）。

但在事务提交之前，您无法运行 Vacuum 分析。

要获取查询计划的更新成本，您只需要ANALYZE 而不是VACUUM。

【讨论】：

谢谢，filiprem。我在插入一些数据后进行了分析，它解决了问题。我已经考虑过改变工作流程，但我认为我仍然需要在大多数时间对每条记录进行更新，并且它现在在 0.14 毫秒内运行 :) 无需更改。