【问题标题】:Hadoop timing out trying to write to Cassandra in AWS multi-region configurationHadoop 尝试在 AWS 多区域配置中写入 Cassandra 时超时
【发布时间】:2014-05-17 04:50:46
【问题描述】:

我在 AWS 中运行一个多 DC Cassandra(开源,而不是 DSE)集群,其中一个 DC (us-west-2) 用于分析,另一个 (us-east) 是事务存储.我将 NetworkTopologyStrategy 与 EC2 告密者一起使用,并且在我的 Hadoop 配置中使用了 LOCAL_ONE 的一致性级别。 Hadoop可以毫无问题地从 Cassandra 读取,但尝试写入会产生超时异常

运行 nodetool status 表明 DC 配置正确:

Datacenter: us-west-2
=====================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Owns   Host ID                               Token                                    Rack
UN  x.x.x.x       1.01 GB     9.9%   9e7f4393-7ac9-4559-b3ff-de48be50016f  -9127921345534057723                     2a
UN  x.x.x.x       1001.16 MB  11.4%  d0760383-c3dd-474c-9261-239b71dba3f1  -9221279003374097975                     2b
UN  x.x.x.x       1.05 GB     11.7%  3f09fbf5-0d85-4283-9009-0ec0e29223c0  -9140104347498952504                     2c
Datacenter: us-east
===================
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address         Load       Owns   Host ID                               Token                                    Rack
UN  x.x.x.x       1.1 GB     11.3%  5bbd2de4-e1d2-4a17-9f40-034f60b35954  -9061054426204373981                     1b
UN  x.x.x.x       1.15 GB    11.5%  e34c590e-6176-45b2-a8f9-18b4a9a80032  -9216519687724118609                     1c
UN  x.x.x.x       1.18 GB    10.9%  fa0b0a1a-f156-40fc-a267-970d1eb9cddb  -9207673937991303291                     1a
UN  x.x.x.x       1.46 GB    10.7%  b18ae406-c9ec-42b7-a365-b0c6e2fe582f  -9206671929961171506                     1a
UN  x.x.x.x       1.13 GB    11.4%  1ac9c1c5-55ad-4048-b1ba-3b9768933ecc  -9146100851344467112                     1c
UN  x.x.x.x       1.53 GB    11.2%  dad665bb-68d9-4811-b421-f33333261867  -9178920986366339267                     1b

使用 ColumnFamilyOutputFormat 的堆栈跟踪:

java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:224)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:123)
    at org.apache.cassandra.hadoop.ColumnFamilyRecordWriter$RangeClient.run(ColumnFamilyRecordWriter.java:215)
Caused by: java.net.ConnectException: Connection timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
    ... 4 more

...并使用 CqlOutputFormat:

java.io.IOException: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:271)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:123)
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:262)
Caused by: java.net.ConnectException: Connection timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
    ... 4 more

两条轨迹最终都指向AbstractColumnFamilyOutputFormat.createAuthenticatedClient(host, port, conf)

然后我打开了那个源并在异常中添加了一些细节,这样它就会输出它正在连接的主机名,这导致了这个跟踪:

java.io.IOException: java.lang.Exception: Unable to connect to host [hostname]
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:271)
Caused by: java.lang.Exception: Unable to connect to host [hostname]
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:139)
    at org.apache.cassandra.hadoop.cql3.CqlRecordWriter$RangeClient.run(CqlRecordWriter.java:262)
Caused by: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection timed out
    at org.apache.thrift.transport.TSocket.open(TSocket.java:185)
    at org.apache.thrift.transport.TFramedTransport.open(TFramedTransport.java:81)
    at org.apache.cassandra.thrift.TFramedTransportFactory.openTransport(TFramedTransportFactory.java:41)
    at org.apache.cassandra.hadoop.AbstractColumnFamilyOutputFormat.createAuthenticatedClient(AbstractColumnFamilyOutputFormat.java:124)
    ... 1 more
Caused by: java.net.ConnectException: Connection timed out
    at java.net.PlainSocketImpl.socketConnect(Native Method)
    at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    at java.net.Socket.connect(Socket.java:579)
    at org.apache.thrift.transport.TSocket.open(TSocket.java:180)
    ... 4 more

问题是 [hostname] 是一台不在分析集群中的机器(它在 us-east)。为什么它不自动知道这一点,尤其是当读取工作正常时?似乎它正在尝试环中的所有节点,而不考虑 DC。

作为记录,使用CqlOutputFormatColumnFamilyOutputFormat 和通过Pig 使用CqlStorageCassandraStorage 写入失败。

【问题讨论】:

    标签: hadoop amazon-web-services amazon-ec2 cassandra


    【解决方案1】:

    我想说,尝试将 cassandra.yaml 中的 write_request_timeout_in_ms 设置为某个非常高的数字,看看是否有帮助。节点本身可能存在问题,当它没有响应但仍显示为启动时。如果仍然超时,请在您怀疑导致问题的节点上重新启动服务。

    【讨论】:

    • 写不超时;它在连接到 DC 外部的节点时超时 - 它根本不应该尝试与之交谈的节点。
    【解决方案2】:

    这个问题归结为两件事:

    1. 对于多区域 EC2 设置,Cassandra 需要将 broadcast_address 设置为公共 IP,将 listen_address 设置为内部 IP。在大多数情况下,您希望 rpc_address 成为内部 IP,但这可能会破坏 Cassandra 的 Hadoop 客户端,该客户端根据广播地址确定要与之通信的端点。

    2. Cassandra 的 Hadoop 客户端(特别是 RingCache)在节点发现方面不尊重数据中心,并尝试发现环中的所有节点——包括非本地节点。它尊重实际写入的一致性级别,但在我们的例子中,由于#1,它从未达到过。

    我提交了一张票并提交了一个补丁来解决这些问题:

    https://issues.apache.org/jira/browse/CASSANDRA-7252

    【讨论】:

      猜你喜欢
      • 2021-05-15
      • 1970-01-01
      • 1970-01-01
      • 2021-07-21
      • 2017-07-29
      • 2015-05-06
      • 1970-01-01
      • 2017-08-29
      • 2019-06-29
      相关资源
      最近更新 更多