【问题标题】:Postgresql 9.4 Cascading replication failoverPostgresql 9.4 级联复制故障转移
【发布时间】:2017-10-02 16:04:33
【问题描述】:

环境:

Ubuntu14.04 + Postgresql9.4.

以下是我的设置:('->' 表示物理流复制 PSR)

Master1 -> Slave1 (primary) -> Slave2

这行为正确 - Master1 上的更改反映在 Slave1 中,然后是 Slave2。

如果我禁用 Master1,并使用 trigger_file 将 Slave1 提升为 Master,则 Slave1 会成功提升 - 我可以写入 Slave1。

但是,新提升的 Slave1 和 Slave2 之间的复制停止

这是预期的行为吗?我期待复制会像这样继续:

Slave1 -> Slave2

这样对 Slave1 的写入会反映在 Slave2 中

更新

日志:

Slave1 提升:

2017-10-03 16:43:20 BST  @ LOCATION:  libpqrcv_connect, libpqwalreceiver.c:107
2017-10-03 16:43:25 BST  @ FATAL:  XX000: could not connect to the primary server: could not connect to server: Connection refused
        Is the server running on host "192.168.20.55" and accepting
        TCP/IP connections on port 5432?

2017-10-03 16:43:25 BST  @ LOCATION:  libpqrcv_connect, libpqwalreceiver.c:107
2017-10-03 16:43:30 BST  @ LOG:  00000: trigger file found: /var/lib/postgresql/9.4/main/failover_trigger.5432
2017-10-03 16:43:30 BST  @ LOCATION:  CheckForStandbyTrigger, xlog.c:11440
2017-10-03 16:43:30 BST  @ LOG:  00000: redo done at 0/19000740
2017-10-03 16:43:30 BST  @ LOCATION:  StartupXLOG, xlog.c:7032
2017-10-03 16:43:30 BST  @ LOG:  00000: last completed transaction was at log time 2017-10-03 16:41:23.430752+01
2017-10-03 16:43:30 BST  @ LOCATION:  StartupXLOG, xlog.c:7037
2017-10-03 16:43:30 BST  @ LOG:  00000: selected new timeline ID: 2
2017-10-03 16:43:30 BST  @ LOCATION:  StartupXLOG, xlog.c:7153
2017-10-03 16:43:30 BST  @ LOG:  00000: archive recovery complete
2017-10-03 16:43:30 BST  @ LOCATION:  exitArchiveRecovery, xlog.c:5459
2017-10-03 16:43:30 BST  @ LOG:  00000: MultiXact member wraparound protections are now enabled
2017-10-03 16:43:30 BST  @ LOCATION:  DetermineSafeOldestOffset, multixact.c:2619
2017-10-03 16:43:30 BST  @ LOG:  00000: database system is ready to accept connections
2017-10-03 16:43:30 BST  @ LOCATION:  reaper, postmaster.c:2795
2017-10-03 16:43:30 BST  @ LOG:  00000: autovacuum launcher started
2017-10-03 16:43:30 BST  @ LOCATION:  AutoVacLauncherMain, autovacuum.c:431

从机2

2017-10-03 16:43:30 BST  @ LOG:  00000: replication terminated by primary server
2017-10-03 16:43:30 BST  @ DETAIL:  End of WAL reached on timeline 1 at 0/190007A8.
2017-10-03 16:43:30 BST  @ LOCATION:  WalReceiverMain, walreceiver.c:446
2017-10-03 16:43:30 BST  @ LOG:  00000: fetching timeline history file for timeline 2 from primary server
2017-10-03 16:43:30 BST  @ LOCATION:  WalRcvFetchTimeLineHistoryFiles, walreceiver.c:669
2017-10-03 16:43:30 BST  @ LOG:  00000: record with zero length at 0/190007A8
2017-10-03 16:43:30 BST  @ LOCATION:  ReadRecord, xlog.c:4184
2017-10-03 16:43:30 BST  @ LOG:  00000: restarted WAL streaming at 0/19000000 on timeline 1
2017-10-03 16:43:30 BST  @ LOCATION:  WalReceiverMain, walreceiver.c:374
2017-10-03 16:43:30 BST  @ LOG:  00000: replication terminated by primary server
2017-10-03 16:43:30 BST  @ DETAIL:  End of WAL reached on timeline 1 at 0/190007A8.

Slave1 IP:

192.168.20.56

Slave2 IP:

192.168.20.53

pg_hba.conf 允许 Slave2 连接到 Slave1 进行复制:

Slave1 pg_hba.conf 段:

host    replication     replication     192.168.20.53/32        trust 

Slave1 recovery.done:

standby_mode = 'on'
primary_conninfo = 'user=replication host=192.168.20.55 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
trigger_file = '/var/lib/postgresql/9.4/main/failover_trigger.5432'

Slave2 recovery.conf:

standby_mode = 'on'
primary_conninfo = 'user=replication host=192.168.20.56 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'

非常感谢任何帮助。

更新及解决方案

感谢@Vao Tsun 的回答,在Slave2 recovery.conf 中添加recovery_target_timeline 设置为'latest',并重启Slave2 postgresql 服务器(不重新加载)允许复制过程重启:

standby_mode = 'on'
primary_conninfo = 'user=replication host=192.168.20.56 port=5432 sslmode=prefer sslcompression=1 krbsrvname=postgres'
recovery_target_timeline = 'latest'

【问题讨论】:

  • 请用日志更新帖子,并检查slave2恢复配置中是否有timeline='latest' - 在推广slave1时切换下一个时间线
  • 我已经用 slave1 和 slave2 的日志文件以及我的 recovery.conf 文件更新了我的问题

标签: postgresql replication cascade postgresql-9.4


【解决方案1】:

你在 slave1 日志中看到:

2017-10-03 16:43:30 BST  @ LOG:  00000: selected new timeline ID: 2

在 slave2 中:

017-10-03 16:43:30 BST  @ DETAIL:  End of WAL reached on timeline 1 at 0/190007A8.

所以slave2在升级后没有切换到时间线二。

正如我在 cmets 中所说,您需要在 slave2 recovery.conf 中使用 recovery_target_timeline='latest'

https://www.postgresql.org/docs/current/static/recovery-target-settings.html

recovery_target_timeline (string) 指定恢复到一个 具体时间线。默认是沿同一时间线恢复 这是进行基本备份时的最新状态。将此设置为 latest 恢复到存档中找到的最新时间线,即 在备用服务器中很有用。除此之外你只需要设置这个 复杂再恢复情况下需要返回的参数 在时间点恢复后达到的状态。看 第 25.3.5 节供讨论。

【讨论】:

  • 非常感谢您的回答。使用 recovery.conf 中设置的该字段重新启动 Slave2 会重新启动复制。
  • 很高兴它有帮助!
猜你喜欢
  • 2013-10-27
  • 1970-01-01
  • 1970-01-01
  • 2020-03-14
  • 2013-01-24
  • 1970-01-01
  • 2011-12-26
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多