【发布时间】:2016-07-04 09:10:03
【问题描述】:
我正在尝试使基础集群正常工作。两个主服务器和两个区域服务器。我的问题是 regionserver 抱怨告诉 master 他们已经启动了。:
2016-07-01 16:10:21,879 WARN [regionserver/nbd-hadoop-data1/153.77.130.27:60020] **regionserver.HRegionServer: reportForDuty failed; sleeping and then retrying.**
2016-07-01 16:10:24,879 INFO [regionserver/nbd-hadoop-data1/153.77.130.27:60020] **regionserver.HRegionServer: reportForDuty to master=0.0.0.0,60000,1467381897236 with port=60020, startcode=1467382178755**
2016-07-01 16:10:24,879 DEBUG [regionserver/nbd-hadoop-data1/153.77.130.27:60020] ipc.AbstractRpcClient: Use SIMPLE authentication for service RegionServerStatusService, sasl=false
2016-07-01 16:10:24,880 DEBUG [regionserver/nbd-hadoop-data1/153.77.130.27:60020] ipc.AbstractRpcClient: Connecting to /0.0.0.0:60000
2016-07-01 16:10:24,880 WARN [regionserver/nbd-hadoop-data1/153.77.130.27:60020] regionserver.HRegionServer: error telling master we are up
com.google.protobuf.ServiceException: java.net.ConnectException: Connection refused
at org.apache.hadoop.hbase.ipc.AbstractRpcClient.callBlockingMethod(AbstractRpcClient.java:223)
at org.apache.hadoop.hbase.ipc.AbstractRpcClient$BlockingRpcChannelImplementation.callBlockingMethod(AbstractRpcClient.java:287)
at org.apache.hadoop.hbase.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$BlockingStub.regionServerStartup(RegionServerStatusProtos.java:8982)
at org.apache.hadoop.hbase.regionserver.HRegionServer.reportForDuty(HRegionServer.java:2270)
at org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:894)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531)
at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495)
奇怪的是它在 0.0.0.0 上打开了端口:
主服务器正在等待区域服务器:
2016-07-01 16:08:43,495 INFO [0.0.0.0:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 220970 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
但是当我停止 regionserver 时,master(Zookeeper) 发现 regionserver 离线了:
2016-07-01 16:55:25,124 WARN [main-EventThread] zookeeper.RegionServerTracker: nbd-hadoop-data1,60020,1467384161702 is not online or isn't known to the master.The latter could be caused by a DNS misconfiguration.
2016-07-01 16:55:26,509 INFO [0.0.0.0:60000.activeMasterManager] master.ServerManager: Waiting for region servers count to settle; currently checked in 0, slept for 3023984 ms, expecting minimum of 1, maximum of 2147483647, timeout of 4500 ms, interval of 1500 ms.
我的 hbase 集群配置是
153.77.130.29 nbd-hadoop-nn1 - zookeeper, hdfs, hbase master
153.77.130.30 nbd-hadoop-nn2 -zookeeper, hdfs, hbase master
153.77.130.22 nbd-service - zookeeper
153.77.130.27 nbd-hadoop-data1 hbase regionserver 1
153.77.130.28 nbd-hadoop-data2 hbase regionserver 2
所有机器都通过以下方式设置了**/etc/hosts**:
127.0.0.1 localhost localhost.localdomain localhost4 localhost4.localdomain4
::1 localhost localhost.localdomain localhost6 localhost6.localdomain6
127.0.0.1 nbd-hadoop-nn1
153.77.130.22 nbd-service
153.77.130.29 nbd-hadoop-nn1
153.77.130.30 nbd-hadoop-nn2
153.77.130.27 nbd-hadoop-data1
153.77.130.28 nbd-hadoop-data2
主服务器bhase-site.xml:
<property>
<name>hbase.master.port</name>
<value>60000</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.lowerLimit</name>
<value>0.38</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.upperLimit</name>
<value>0.4</value>
</property>
<property>
<name>hbase.regionserver.handler.count</name>
<value>60</value>
</property>
<property>
<name>hbase.regionserver.info.port</name>
<value>60030</value>
</property>
<property>
<name>hbase.regionserver.port</name>
<value>60020</value>
</property>
区域服务器bhase-site.xml:
<property>
<name>hbase.master.info.port</name>
<value>60010</value>
</property>
<property>
<name>hbase.master.port</name>
<value>60000</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.lowerLimit</name>
<value>0.38</value>
</property>
<property>
<name>hbase.regionserver.global.memstore.upperLimit</name>
<value>0.4</value>
</property>
<property>
<name>hbase.regionserver.handler.count</name>
<value>60</value>
</property>
<property>
<name>hbase.regionserver.port</name>
<value>60020</value>
</property>
<property>
<name>hbase.regionserver.info.port</name>
<value>60030</value>
</property>
netstat -ntlp 来自 主服务器 nbd-hadoop-nn1(在 ::: 显示正确打开的端口 60000):
tcp 0 0 :::60000 :::* LISTEN 30839/java
netstat -ntlp from Region server nbd-hadoop-data1 显示端口 60020 已绑定到 localhost。
我认为这是问题的根源:
tcp 0 0 ::ffff:127.0.0.1:60020 :::* LISTEN 22858/java
我无法从主服务器telnet nbd-hadoop-data1 60020 ** 远程登录 Regions 服务器的端口 60020 - 连接被拒绝。
这可能是问题的根源,但我不知道如何重新配置它。我没有找到任何地方为什么区域服务器在::ffff:127.0.0.1:60020 开放端口。
非常感谢您的提示。如果您需要其他日志或配置文件,我会提供。
【问题讨论】: