【发布时间】:2018-01-30 16:28:40
【问题描述】:
我们在不同的 AWS 数据中心拥有三个节点,其中一个是唯一的种子节点和单例的独家拥有者,通过在单例代理设置上使用 .withDataCenter 完成。我们可以通过启动种子节点然后其他节点来让我们的集群按设计工作,但是如果任何节点出现故障,似乎让它们再次通话的唯一方法是以相同的方式重新启动整个集群。我们想让它们尝试重新连接到种子节点并在可能的情况下恢复正常操作。
当我关闭一个非种子节点时,种子节点将其标记为 UNREACHABLE 并开始定期记录以下内容:
Association with remote system [akka.tcp://application@xxx.xx.x.xxx:xxxx] has failed, address is now gated for [5000] ms. Reason: [Association failed with [akka.tcp://application@xxx.xx.x.xxx:xxxx]] Caused by: [connection timed out: /xxx.xx.x.xxx:xxxx]
很公平。但是,当我重新启动节点时,新启动的节点开始重复:
2018-01-29 22:59:09,587 [DEBUG]: akka.cluster.ClusterCoreDaemon in application-akka.actor.default-dispatcher-18 -
now supervising Actor[akka://application/system/cluster/core/daemon/joinSeedNodeProcess-16#-1572745962]
2018-01-29 22:59:09,587 [DEBUG]: akka.cluster.JoinSeedNodeProcess in application-akka.actor.default-dispatcher-3 -
started (akka.cluster.JoinSeedNodeProcess@2ae57537)
2018-01-29 22:59:09,755 [DEBUG]: akka.cluster.JoinSeedNodeProcess in application-akka.actor.default-dispatcher-2 -
stopped
种子节点日志:
2018-01-29 22:56:25,442 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-4 -
Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - New incarnation of existing member [Member(address = akka.tcp://application@172.xx.x.xxx:xxxx, dataCenter = indonesia, status = Up)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
2018-01-29 22:56:25,443 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 -
Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - Marking unreachable node [akka.tcp://application@172.xx.x.xxx:xxxx] as [Down]
之后重复:
2018-01-29 22:57:41,659 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 -
Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - Sending InitJoinAck message from node [akka.tcp://application@52.xx.xxx.xx:xxxx] to [Actor[akka.tcp://application@172.xx.x.xxx:xxxx/system/cluster/core/daemon/joinSeedNodeProcess-8#-1322646338]]
2018-01-29 22:57:41,827 [INFO ]: a.c.Cluster(akka://application) in application-akka.actor.default-dispatcher-18 -
Cluster Node [akka.tcp://application@52.xx.xxx.xx:xxxx] dc [asia] - New incarnation of existing member [Member(address = akka.tcp://application@172.xx.x.xxx:xxxx, dataCenter = indonesia, status = Down)] is trying to join. Existing will be removed from the cluster and then new member will be allowed to join.
我觉得奇怪的是,日志表明“将”发生的事情不会发生,现有的被删除,新的成员被允许加入。我一直在谷歌上搜索该消息,但找不到关于我可能需要做什么才能真正实现这一点的解释。
【问题讨论】:
-
Akka 的多 dc 特性允许您拥有多个 dc-local 集群,它们是更大集群的一部分,每个 dc 有一个领导者,这意味着运行一个三节点集群,其中每个节点都是一个单独的 dc(因此是它自己的子集群中的领导者)没有多大意义。不确定这是否与您的问题有关,但请注意,集群中至少有 3 个节点的常规建议也适用于这些子集群。
标签: akka akka-cluster akka-remoting