【问题标题】:Cassandra 4.0 nodes fails to restartCassandra 4.0 节点无法重启
【发布时间】:2021-10-18 22:02:06
【问题描述】:

重启后节点失败并出现错误:

INFO  [Messaging-EventLoop-3-12] 2021-08-17 11:09:07,845 InboundConnectionInitiator.java:464 - /X.X.46.68:7000(/X.X.46.68:56090)->/X.X.X.77:7000-URGENT_MESSAGES-cdaa1ab9 messaging connection established, version = 12, framing = LZ4, encryption = unencrypted
INFO  [Messaging-EventLoop-3-1] 2021-08-17 11:09:07,867 InboundConnectionInitiator.java:464 - /X.X.86.42:7000(/X.X.86.42:52188)->/X.X.X.77:7000-URGENT_MESSAGES-9c2d74c5 messaging connection established, version = 12, framing = CRC, encryption = unencrypted
ERROR [main] 2021-08-17 11:09:08,523 CassandraDaemon.java:909 - Exception encountered during startup
java.lang.RuntimeException: Unable to gossip with any peers
    at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1801)
    at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:648)
    at org.apache.cassandra.service.StorageService.prepareToJoin(StorageService.java:934)
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:784)
    at org.apache.cassandra.service.StorageService.initServer(StorageService.java:729)
    at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:420)
    at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
    at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
INFO  [StorageServiceShutdownHook] 2021-08-17 11:09:08,530 HintsService.java:220 - Paused hints dispatch
WARN  [StorageServiceShutdownHook] 2021-08-17 11:09:08,531 Gossiper.java:1989 - No local state, state is in silent shutdown, or node hasn't joined, not announcing shutdown
INFO  [StorageServiceShutdownHook] 2021-08-17 11:09:08,531 MessagingService.java:441 - Waiting for messaging service to quiesce
INFO  [Messaging-EventLoop-3-7] 2021-08-17 11:09:08,534 OutboundConnection.java:1150 - /X.X.X.77:7000(/X.X.X.77:52766)->/X.X.X.76:7000-SMALL_MESSAGES-27a82ea6 successfully connected, version = 12, framing = CRC, encryption = unencrypted
INFO  [Messaging-EventLoop-3-8] 2021-08-17 11:09:08,534 OutboundConnection.java:1150 - /X.X.X.77:7000(/X.X.X.77:52768)->/X.X.X.76:7000-LARGE_MESSAGES-762ad3e9 successfully connected, version = 12, framing = CRC, encryption = unencrypted
INFO  [Messaging-EventLoop-3-1] 2021-08-17 11:09:08,535 OutboundConnection.java:1150 - /X.X.X.77:7000(/X.X.X.77:35938)->/X.X.X.40:7000-SMALL_MESSAGES-97e069da successfully connected, version = 12, framing = CRC, encryption = unencrypted

当节点启动时,种子和其他节点会在调试日志中显示以下内容:

   ERROR [Messaging-EventLoop-3-2] 2021-08-17 11:09:07,535 OutboundConnection.java:1058 - /X.X.X.116:7000->/X.X.X.77:7000-URGENT_MESSAGES-ef747971 channel in potentially inconsistent state after error; closing
java.lang.IllegalArgumentException: Maximum payload size is 128KiB
    at org.apache.cassandra.net.FrameEncoderCrc.encode(FrameEncoderCrc.java:73)
    at org.apache.cassandra.net.FrameEncoder.write(FrameEncoder.java:134)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWrite0(AbstractChannelHandlerContext.java:717)
    at io.netty.channel.AbstractChannelHandlerContext.invokeWriteAndFlush(AbstractChannelHandlerContext.java:764)
    at io.netty.channel.AbstractChannelHandlerContext.write(AbstractChannelHandlerContext.java:790)
    at io.netty.channel.AbstractChannelHandlerContext.writeAndFlush(AbstractChannelHandlerContext.java:758)
    at io.netty.channel.DefaultChannelPipeline.writeAndFlush(DefaultChannelPipeline.java:1020)
    at io.netty.channel.AbstractChannel.writeAndFlush(AbstractChannel.java:299)
    at org.apache.cassandra.net.AsyncChannelPromise.writeAndFlush(AsyncChannelPromise.java:77)
    at org.apache.cassandra.net.OutboundConnection$EventLoopDelivery.doRun(OutboundConnection.java:837)
    at org.apache.cassandra.net.OutboundConnection$Delivery.run(OutboundConnection.java:687)
    at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
    at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:384)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
INFO  [Messaging-EventLoop-3-10] 2021-08-17 11:09:08,540 InboundConnectionInitiator.java:464 - /X.X.X.77:7000(/X.X.X.77:36684)->/X.X.X.116:7000-SMALL_MESSAGES-8ab4a5dc messaging connection established, version = 12, framing = CRC, encryption = unencrypted
INFO  [Messaging-EventLoop-3-11] 2021-08-17 11:09:08,540 InboundConnectionInitiator.java:464 - /X.X.X.77:7000(/X.X.X.77:36686)->/X.X.X.116:7000-LARGE_MESSAGES-7f053d49 messaging connection established, version = 12, framing = CRC, encryption = unencrypted
INFO  [Messaging-EventLoop-3-2] 2021-08-17 11:09:15,680 NoSpamLogger.java:92 - /X.X.X.116:7000->/X.X.X.77:7000-URGENT_MESSAGES-[no-channel] failed to connect
io.netty.channel.AbstractChannel$AnnotatedConnectException: finishConnect(..) failed: Connection refused: /X.X.X.77:7000
Caused by: java.net.ConnectException: finishConnect(..) failed: Connection refused
    at io.netty.channel.unix.Errors.throwConnectException(Errors.java:124)
    at io.netty.channel.unix.Socket.finishConnect(Socket.java:251)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.doFinishConnect(AbstractEpollChannel.java:673)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.finishConnect(AbstractEpollChannel.java:650)
    at io.netty.channel.epoll.AbstractEpollChannel$AbstractEpollUnsafe.epollOutReady(AbstractEpollChannel.java:530)
    at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:470)
    at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
    at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
    at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
    at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
    at java.lang.Thread.run(Thread.java:748)
INFO  [Messaging-EventLoop-3-2] 2021-08-17 11:09:45,714 NoSpamLogger.java:92 - /X.X.X.116:7000->/X.X.X.77:7000-URGENT_MESSAGES-[no-channel] failed to connect

从 3.10 升级到 4.0 后开始发生。不是防火墙问题或错误配置,因为之前的配置相同。

【问题讨论】:

    标签: cassandra


    【解决方案1】:

    这些都不是错误,因此它们不是您的节点无法重新启动的原因。

    八卦的第一个条目记录在DEBUG,所以这不是问题。消息的第二个条目记录在 INFO 级别,因此它只是提供信息,无需关注。

    您需要查看 system.log 并注意最后 1 或 2 个 ERROR 条目,因为它们与了解节点无法重新启动的原因有关。干杯!

    [EDIT]此错误表明联系种子节点时存在问题:

    ERROR [main] 2021-08-17 11:09:08,523 CassandraDaemon.java:909 - Exception encountered during startup
    java.lang.RuntimeException: Unable to gossip with any peers
        at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1801)
        at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:648)
        ...
    

    在 Cassandra 4.0 中,节点现在通过其 IP + 端口 (CASSANDRA-7544) 的组合来标识,因此请确保您已相应地配置了种子列表。例如:

    seed_provider:
        - class_name: org.apache.cassandra.locator.SimpleSeedProvider
          parameters:
              - seeds: "10.1.2.3:7000,10.1.2.4:7000,10.1.2.5:7000"
    

    至少有一个种子节点已启动并完全运行,这一点非常重要。为此,建议先升级种子节点。

    还要确保使用 Linux 实用程序(例如 nctelnet)的节点之间存在网络连接。检查端口7000 上节点之间的流量是否被防火墙阻止(例如iptablesfirewalld)。如果您重新启动服务器,防火墙意外启用是很常见的。

    [UPDATE] 检查服务器上的时钟是否同步。如果漂移太大,节点将无法八卦。干杯!

    【讨论】:

    • 你是对的;失败的原因是:ERROR [main] 2021-08-17 02:37:51,348 CassandraDaemon.java:909 - Exception encountered during startup java.lang.RuntimeException: Unable to gossip with any peers at org.apache.cassandra.gms.Gossiper.doShadowRound(Gossiper.java:1801) at org.apache.cassandra.service.StorageService.checkForEndpointCollision(StorageService.java:648) 我的假设是所有其他节点的八卦都失败了:错误...通道在错误后可能处于不一致的状态;关闭 java.lang.IllegalArgumentException:最大有效负载大小为 128KiB
    • 您能否编辑您的原始问题并在那里发布完整的错误+完整的堆栈跟踪?干杯!
    • 我已根据新信息更新了我的答案。干杯!
    • 不幸的是,将端口号添加到种子后问题仍然存在46:7000" ``` 防火墙也不太可能是原因,因为节点之间存在八卦数据。
    • 您发布的日志条目另有说明。他们不能八卦,这就是为什么您在端口7000 上看到failed to connectConnection refused 的通信。干杯!
    【解决方案2】:

    问题中的以下错误消息是由于节点(重新)启动时发送的八卦消息的大小可能超过大型集群中的硬限制。

     ERROR [Messaging-EventLoop-3-2] 2021-08-17 11:09:07,535 OutboundConnection.java:1058 - /X.X.X.116:7000->/X.X.X.77:7000-URGENT_MESSAGES-ef747971 channel in potentially inconsistent state after error; closing
    java.lang.IllegalArgumentException: Maximum payload size is 128KiB 
    

    这是自 4.0-alpha1 以来的一个错误,已在 4.0.1 中修复。检查CASSANDRA-16877

    此外,如果您在其中一个种子节点中看到如下日志消息,这是由于 Erick 在其更新的答案中提到的节点之间的时钟漂移。

    INFO  [ScheduledTasks:1] 2021-09-10 11:14:26,567 MessagingMetrics.java:206 - GOSSIP_DIGEST_SYN messages were dropped in last 5000 ms: 0 internal and 1 cross node. Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 15137813 ms
    

    【讨论】:

      猜你喜欢
      • 2015-06-02
      • 1970-01-01
      • 2014-07-22
      • 2016-03-06
      • 2015-07-16
      • 1970-01-01
      • 2020-07-05
      • 2013-10-12
      • 1970-01-01
      相关资源
      最近更新 更多