【问题标题】:pg_dump Crashing PostgreSQL Serverpg_dump 使 PostgreSQL 服务器崩溃
【发布时间】:2021-08-19 19:28:19
【问题描述】:

在包含大量 blob 的数据库上运行 pg_dump,当执行此查询时 PostgreSQL 崩溃:

pg_dump: reading large objects
pg_dump: error: query failed: SSL SYSCALL error: EOF detected
pg_dump: error: query was: SELECT l.oid, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = l.lomowner) AS rolname, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl, row_n FROM pg_catalog.unnest(coalesce(l.lomacl,pg_catalog.acldefault('L',l.lomowner))) WITH ORDINALITY AS perm(acl,row_n) WHERE NOT EXISTS ( SELECT 1 FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('L',l.lomowner))) AS init(init_acl) WHERE acl = init_acl)) as foo) AS lomacl, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl, row_n FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('L',l.lomowner))) WITH ORDINALITY AS initp(acl,row_n) WHERE NOT EXISTS ( SELECT 1 FROM pg_catalog.unnest(coalesce(l.lomacl,pg_catalog.acldefault('L',l.lomowner))) AS permp(orig_acl) WHERE acl = orig_acl)) as foo) AS rlomacl, NULL AS initlomacl, NULL AS initrlomacl FROM pg_largeobject_metadata l LEFT JOIN pg_init_privs pip ON (l.oid = pip.objoid AND pip.classoid = 'pg_largeobject'::regclass AND pip.objsubid = 0)

我已经对查询进行了实验,lomaclrlomacl 的两个 array_agg() 列似乎是罪魁祸首。

这是 AWS Aurora PostgreSQL 11:

SELECT version();
                                             version
-------------------------------------------------------------------------------------------------
PostgreSQL 11.9 on x86_64-pc-linux-gnu, compiled by x86_64-pc-linux-gnu-gcc (GCC) 7.4.0, 64-bit

日志:

2021-08-19 19:47:46 UTC::@:[46753]:LOG: server process (PID 21837) was terminated by signal 9: Killed
2021-08-19 19:47:46 UTC::@:[46753]:DETAIL: Failed process was running: SELECT l.oid, (SELECT rolname FROM pg_catalog.pg_roles WHERE oid = l.lomowner) AS rolname, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl, row_n FROM pg_catalog.unnest(coalesce(l.lomacl,pg_catalog.acldefault('L',l.lomowner))) WITH ORDINALITY AS perm(acl,row_n) WHERE NOT EXISTS ( SELECT 1 FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('L',l.lomowner))) AS init(init_acl) WHERE acl = init_acl)) as foo) AS lomacl, (SELECT pg_catalog.array_agg(acl ORDER BY row_n) FROM (SELECT acl, row_n FROM pg_catalog.unnest(coalesce(pip.initprivs,pg_catalog.acldefault('L',l.lomowner))) WITH ORDINALITY AS initp(acl,row_n) WHERE NOT EXISTS ( SELECT 1 FROM pg_catalog.unnest(coalesce(l.lomacl,pg_catalog.acldefault('L',l.lomowner))) AS permp(orig_acl) WHERE acl = orig_acl)) as foo) AS rlomacl, NULL AS initlomacl, NULL AS initrlomacl FROM pg_largeobject_metadata l LEFT JOIN pg_init_privs pip ON (l.oid = pip.objoid AND pip.classoid = 'pg_largeobject'::regclass AND pip.objsubid = 0)
2021-08-19 19:47:46 UTC::@:[46753]:LOG: terminating any other active server processes
2021-08-19 19:47:46 UTC::@:[46753]:FATAL: Can't handle storage runtime process crash
2021-08-19 19:47:46 UTC::@:[46753]:LOG: database system is shut down

任何故障排除步骤/建议?

【问题讨论】:

  • 您的确切 Postgres 版本是什么(select version() 会告诉您)以及您使用的是哪个操作系统?
  • signal 9: Killed 似乎表明它严重崩溃了?
  • 可能被 OOM 杀手杀死。查看 /var/log/kern.log
  • 尝试pg_dump--no-blobs 以不要转储它们,只是为了确认它们是问题所在。
  • 是的,做了一个没有斑点的转储,也没有问题。

标签: postgresql


【解决方案1】:

pg_dump 使用 pg_largeobject_metadatapg_init_privs 之间的连接来获取要转储的大型对象 OID 列表。

现在要么数据库服务器上的内存太低,要么你有很多大对象,而你的work_mem 设置得非常高,以至于数据库服务器机器内存不足。由于您没有在数据库服务器操作系统上禁用内存过量使用,OOM 杀手会杀死您的进程。

要么增加可用 RAM,要么使用更保守的 work_mem 设置。我必须补充一点,在 v13 之前,PostgreSQL 很容易错误地创建大于 work_mem 的哈希值。也许您可以在转储期间将enable_hashjoin 设置为off

【讨论】:

  • 这是在 db.r5.12xlarge 实例(48 vCPU/384GB RAM)上运行的,我尝试将 work_mem 设置在 4MB - 16GB 之间。我会看看那个 hashjoin 设置,谢谢!我目前也在运行一个真空吸尘器,它正在移除 +2500 万个孤立的大型对象。
  • 这肯定会有所帮助。大型对象有很多问题,特别是如果您有很多对象。
猜你喜欢
  • 2014-02-03
  • 1970-01-01
  • 2016-06-12
  • 2017-07-29
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多