【问题标题】:Datanode restarts on doing Hadoop fs -put for huge data(30 GB)Datanode 重新启动 Hadoop fs -put 处理大数据(30 GB)
【发布时间】:2012-11-06 13:43:31
【问题描述】:

我有一个有 3 个节点的 hadoop 集群。 1个主人和2个奴隶。他们每个人都有 24 GB 内存。 当我执行时

hadoop fs -put 

将数据从本地文件系统传输到 hdfs 圆顶数据被传输,然后我得到一个异常

12/11/06 19:01:39 WARN hdfs.DFSClient: DFSOutputStream ResponseProcessor exception      for block blk_-2646313249080465541_1002java.net.SocketTimeoutException: 603000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/172.30.30.210:51735 remote=/172.30.30.211:50010]
    at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155)
    at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128)
    at java.io.DataInputStream.readFully(DataInputStream.java:178)
    at java.io.DataInputStream.readLong(DataInputStream.java:399)
    at org.apache.hadoop.hdfs.protocol.DataTransferProtocol$PipelineAck.readFields(DataTransferProtocol.java:125)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$ResponseProcessor.run(DFSClient.java:3284)

12/11/06 19:01:39 WARN hdfs.DFSClient: Error Recovery for block blk_-2646313249080465541_1002 bad datanode[0] 172.30.30.211:50010
put: All datanodes 172.30.30.211:50010 are bad. Aborting...
12/11/06 19:01:39 ERROR hdfs.DFSClient: Exception closing file /user/root/input/wiki.xml-p000185003p000189874 : java.io.IOException: All datanodes 172.30.30.211:50010 are bad. Aborting...
java.io.IOException: All datanodes 172.30.30.211:50010 are bad. Aborting...
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.processDatanodeError(DFSClient.java:3414)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream.access$2000(DFSClient.java:2906)
    at org.apache.hadoop.hdfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:3110)

我使用 30 GB 数据进行传输,仅传输了 22 GB,然后出现此异常,两个数据节点都重新启动。 缓冲区有没有问题。我的意思是datanode正在通过套接字从namenode接收数据,并且可能是datanodes缓冲区不够大,无法容纳大量数据并导致此异常。

这些是 HDFS 创建的日志文件

2012-11-06 18:54:10,074 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2012-11-06 18:54:10,239 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink ganglia started
2012-11-06 18:54:10,349 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered.
2012-11-06 18:54:10,350 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2012-11-06 18:54:10,350 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: DataNode metrics system started
2012-11-06 18:54:10,644 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
2012-11-06 18:54:11,387 WARN org.apache.hadoop.hdfs.server.common.Storage: Ignoring storage directory /data/hadoop/data due to exception: java.io.FileNotFoundException: /data/hadoop/data/in_use.lock (Permission denied)
at java.io.RandomAccessFile.open(Native Method)
at java.io.RandomAccessFile.<init>(RandomAccessFile.java:216)
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.tryLock(Storage.java:703)
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.lock(Storage.java:684)
at org.apache.hadoop.hdfs.server.common.Storage$StorageDirectory.analyzeStorage(Storage.java:542)
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:112)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:408)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:306)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1623)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1562)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1580)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1707)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1724)

2012-11-06 18:54:11,551 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.io.IOException: All specified directories are not accessible or do not exist.
at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:143)
at org.apache.hadoop.hdfs.server.datanode.DataNode.startDataNode(DataNode.java:408)
at org.apache.hadoop.hdfs.server.datanode.DataNode.<init>(DataNode.java:306)
at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1623)
at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1562)
at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1580)
at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1707)
at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1724)

2012-11-06 18:54:11,552 INFO org.apache.hadoop.hdfs.server.datanode.DataNode:     

这些是 Mapred 创建的

2012-11-06 18:54:29,395 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties
2012-11-06 18:54:29,416 INFO org.apache.hadoop.metrics2.impl.MetricsSinkAdapter: Sink ganglia started
2012-11-06 18:54:29,449 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source MetricsSystem,sub=Stats registered.
2012-11-06 18:54:29,450 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled snapshot period at 10 second(s).
2012-11-06 18:54:29,450 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: TaskTracker metrics system started
2012-11-06 18:54:29,792 INFO org.apache.hadoop.metrics2.impl.MetricsSourceAdapter: MBean for source ugi registered.
2012-11-06 18:54:30,002 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog
2012-11-06 18:54:30,056 INFO org.apache.hadoop.http.HttpServer: Added global filtersafety (class=org.apache.hadoop.http.HttpServer$QuotingInputFilter)
2012-11-06 18:54:30,103 INFO org.apache.hadoop.mapred.TaskLogsTruncater: Initializing logs' truncater with mapRetainSize=-1 and reduceRetainSize=-1
2012-11-06 18:54:30,107 INFO org.apache.hadoop.mapred.TaskTracker: Starting tasktracker with owner as mapred
2012-11-06 18:54:30,108 INFO org.apache.hadoop.mapred.TaskTracker: Good mapred local directories are: /data/hadoop/mapred
2012-11-06 18:54:30,145 ERROR org.apache.hadoop.mapred.TaskTracker: Can not start task tracker because java.io.IOException: Cannot rename /data/hadoop/mapred/ttprivate to /data/hadoop/mapred/toBeDeleted/2012-11-06_18-54-30.117_0
at org.apache.hadoop.util.MRAsyncDiskService.moveAndDeleteRelativePath(MRAsyncDiskService.java:260)
at org.apache.hadoop.util.MRAsyncDiskService.cleanupAllVolumes(MRAsyncDiskService.java:315)
at org.apache.hadoop.mapred.TaskTracker.initialize(TaskTracker.java:736)
at org.apache.hadoop.mapred.TaskTracker.<init>(TaskTracker.java:1515)
at org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:3814)

【问题讨论】:

  • 这和dfs.stream-buffer-size有关系吗??
  • 您在 datanode 日志中看到了哪些错误?为什么他们已经死了?
  • datanode 没有死。运行数据节点的所有机器都重新启动。
  • 从已删除的日志中如下所示(请将它们添加到您的问题中),看起来您应该检查 dfs.data.dirs 的存在性和 hdfs 用户的可写性。
  • 我检查了数据目录,它们存在,正如我在问题中提到的,一些数据(20 GB 关闭 30 GB)正在写入。所以目录和正确的写权限都在那里。

标签: hadoop hdfs


【解决方案1】:

您的问题与集群内的大量数据传输有关。 Apache Hadoop 有一个用于此目的的特定工具,称为 distcp。它使集群中的所有节点都能平等地参与到 HDFS 的数据传输。

请在http://hadoop.apache.org/docs/r0.20.2/distcp.html阅读更多内容

【讨论】:

  • 我有 error-java.io.IOException: All datanodes 10.0.0.15:50010 are bad。正在中止...在使用 Hive 时(在读表时),我怎样才能在那里使用 distcp ?
  • @SagarNikam 当数据节点坏了你应该首先执行文件系统检查link。这不是 linux 本地文件系统检查。但它是一个 HDFS 文件系统检查命令。这应该能够处理坏数据节点问题/损坏的数据块。
猜你喜欢
  • 1970-01-01
  • 2014-12-15
  • 2011-12-10
  • 2013-08-31
  • 2022-12-22
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多