【问题标题】:Postgres HA (based on WAL-shipping) failsPostgres HA(基于 WAL-shipping)失败
【发布时间】:2024-01-09 22:58:01
【问题描述】:

我希望有人可以帮助我解决 WAL 运输和热待机问题。我的备用系统愉快地运行了数周,然后突然开始寻找不存在的 .history 文件。然后它崩溃了,如果不重建备用服务器,我就无法成功重新启动它。

两个系统都运行 CentOS 4.5 和 postgres 8.4.1。他们使用 NFS 在备用服务器上存储来自生产环境的 WAL 文件。

与我的 cmets 相关的日志块:

[** Recovery is running normally **]

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000001000000830000005B
WAL file path           : /var/tafkan_backup_from_db1/00000001000000830000005B
Restoring to            : pg_xlog/RECOVERYXLOG
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000001000000830000005B" "pg_xlog/RECOVERYXLOG"
Keep archive history    : 00000001000000830000004D and later
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
running restore         : OK

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000001000000830000005B
WAL file path           : /var/tafkan_backup_from_db1/00000001000000830000005B
Restoring to            : pg_xlog/RECOVERYXLOG
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000001000000830000005B" "pg_xlog/RECOVERYXLOG"
Keep archive history    : 000000000000000000000000 and later
running restore         : OK

[** All of a sudden it starts looks for .history files **]

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000002.history
WAL file path           : /var/tafkan_backup_from_db1/00000002.history
Restoring to            : pg_xlog/RECOVERYHISTORY
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000002.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history    : 000000000000000000000000 and later
running restore         :cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
not restored
history file not found
Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000001.history
WAL file path           : /var/tafkan_backup_from_db1/00000001.history
Restoring to            : pg_xlog/RECOVERYHISTORY
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000001.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history    : 000000000000000000000000 and later
running restore         :cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000001.history': No such file or directory
not restored
history file not found

[** I stopped Postgres, renamed recovery.done to recovery.conf, and restarted it. **]

Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 00000002.history
WAL file path           : /var/tafkan_backup_from_db1/00000002.history
Restoring to            : pg_xlog/RECOVERYHISTORY
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/00000002.history" "pg_xlog/RECOVERYHISTORY"
Keep archive history    : 000000000000000000000000 and later
running restore         :cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
cp: cannot stat `/var/tafkan_backup_from_db1/00000002.history': No such file or directory
not restored
history file not found
Trigger file            : /tmp/pgsql.trigger
Waiting for WAL file    : 0000000200000083000000A2
WAL file path           : /var/tafkan_backup_from_db1/0000000200000083000000A2
Restoring to            : pg_xlog/RECOVERYXLOG
Sleep interval          : 2 seconds
Max wait interval       : 0 forever
Command for restore     : cp "/var/tafkan_backup_from_db1/0000000200000083000000A2" "pg_xlog/RECOVERYXLOG"
Keep archive history    : 000000000000000000000000 and later
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...
WAL file not present yet. Checking for trigger file...

[** This file is not present. All WAL files start with 00000001. **] 

有什么想法吗?我什至不知道 .history 文件是什么,而且(大部分优秀的)文档对此都不是很清楚。

PS。我希望我正在运行虚拟机,这样我就可以使用link text 而不必担心这些应用程序级的 HA 废话:-)

更新:以下是大约此时来自备用服务器的一些日志。看起来有些东西使服务器停止恢复并上线,但我不知道是什么。我很确定没有任何东西可以创建触发器文件。

2010-01-20 03:30:15 EST 4b3a5c63.401b LOG:  restored log file "00000001000000830000005A" from archive
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  restored log file "00000001000000830000005B" from archive
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  record with zero length at 83/5BFA2FF8
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  redo done at 83/5BFA2FAC
2010-01-20 03:30:23 EST 4b3a5c63.401b LOG:  last completed transaction was at log time 2010-01-20 03:28:04.594399-05
2010-01-20 03:30:25 EST 4b3a5c63.401b LOG:  restored log file "00000001000000830000005B" from archive
2010-01-20 03:30:37 EST 4b3a5c63.401b LOG:  selected new timeline ID: 2
2010-01-20 03:30:49 EST 4b3a5c63.401b LOG:  archive recovery complete
2010-01-20 03:30:59 EST 4b3a5c62.4019 LOG:  database system is ready to accept connections

【问题讨论】:

  • 嗨 sbleon,我只想将 WAL 文件备份到备用位置,我不需要热备,你能帮忙吗??
  • @indyaah,查看the excellent PostgreSQL docs 的版本。
  • 感谢帮助的朋友。!! :D

标签: postgresql high-availability log-shipping


【解决方案1】:

一种完全不同的 HA 方法可能是将 PG 数据库托管在两台机器共享的 DRBD 设备上。

【讨论】:

  • 感谢您的建议!如果我不能让 WAL-shipping 可靠地工作,我可能会这样做。
【解决方案2】:

您是否使用了自己的恢复脚本/程序?如果是 - 请不要这样做。使用 PostgreSQL contrib 中的 pg_standby。

否则 - 忽略 .history 文件。

【讨论】:

  • 我正在使用 pg_standby。 recovery.conf 包含:“restore_command = 'pg_standby -l -d -s 2 -t /tmp/pgsql.trigger /var/tafkan_backup_from_db1 %f %p %r 2>>standby.log'”。我不能忽略 .history 文件,因为当 pg_standby 开始寻找它们时恢复失败,recovery.conf 被移动到 recovery.done,并且 WAL 文件开始迅速堆积。
【解决方案3】:

您的复制副本在某个时间点上线。 “00000002.history”正在寻找时间线 00000002 的历史文件,而您的其余日志以 00000001 开头,即原始时间线。

我会在它开始查找历史文件之前检查您的 PostgreSQL 日志,看看是否有任何迹象表明数据库已上线,即使是片刻。

【讨论】:

  • 谢谢,马修。我在我的问题中添加了一些日志。你说得对,它上线了,但我无法想象是什么,或者为什么。
  • 源端发生了什么事吗?条目“83/5BFA2FF8 的零长度记录”看起来只是它尝试恢复的部分 WAL 日志。 IIRC,当它在 WAL 中遇到无效记录时,它会回滚到该 WAL 中的最后一个 good 记录,然后上线,无论是否存在触发器文件。我会在 2010-01-20 03:28:04.594399-05 左右查看两个系统日志,看看 Postgres、操作系统或网络中是否有任何错误。
  • 这种行为是有道理的。如果Backup 发现Primary 出现故障,它会假定Primary 已经死掉,它应该弥补这一缺陷。我怀疑这里可能存在网络问题。我要从那个角度看。谢谢!
【解决方案4】:

我能够通过更新我的两台 PostgreSQL 服务器上的 CentOS 操作系统来解决这个问题。因此,我认为这是某种潜在网络错误的症状。

【讨论】: