【问题标题】:distcp fails when copying from s3 to hdfs从 s3 复制到 hdfs 时 distcp 失败
【发布时间】:2018-05-27 03:05:48
【问题描述】:

创建了一个集群 (Spark Amazon EMR) 并尝试在命令行中运行。

CLI:

hadoop distcp s3a://bucket/file1 /data

例外:

org.apache.hadoop.yarn.exceptions.InvalidAuxServiceException: The auxService:mapreduce_shuffle does not exist
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateExceptionImpl(SerializedExceptionPBImpl.java:171)
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.instantiateException(SerializedExceptionPBImpl.java:182)
        at org.apache.hadoop.yarn.api.records.impl.pb.SerializedExceptionPBImpl.deSerialize(SerializedExceptionPBImpl.java:106)
        at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$Container.launch(ContainerLauncherImpl.java:162)
        at org.apache.hadoop.mapreduce.v2.app.launcher.ContainerLauncherImpl$EventProcessor.run(ContainerLauncherImpl.java:408)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:748)

【问题讨论】:

    标签: amazon-web-services hadoop amazon-s3 amazon-emr


    【解决方案1】:

    请检查/etc/hadoop/conf/yarn-site.xml中yarn-site.xml的属性,

     <property>
      <name>yarn.nodemanager.aux-services</name> 
      <value>mapreduce_shuffle,spark_shuffle</value>
     </property>
    
     <property>
       <name>yarn.nodemanager.aux-services.spark_shuffle.class</name> 
       <value>org.apache.spark.network.yarn.YarnShuffleService</value> 
     </property>
    
    <property>
      <name>yarn.nodemanager.aux-services.mapreduce_shuffle.class</name> 
      <value>org.apache.hadoop.mapred.ShuffleHandler</value>
    </property>
    

    如果没有mapreduce_shuffle,请添加属性并重启yarn服务。

    sudo stop hadoop-yarn-nodemanager
    sudo start hadoop-yarn-nodemanager
    

    我建议使用 s3-distcp 实用程序,因为它已经在 EMR 集群中可用。

    s3-dist-cp --src s3://my-tables/incoming/hourly_table --dest /data/hdfslocation/path
    

    https://aws.amazon.com/blogs/big-data/seven-tips-for-using-s3distcp-on-amazon-emr-to-move-data-efficiently-between-hdfs-and-amazon-s3/

    【讨论】:

    • 谢谢。那行得通。使用 s2-distcp 代替 distcp,它涵盖了 s3 背后的所有内容。
    猜你喜欢
    • 2014-11-13
    • 2020-12-22
    • 1970-01-01
    • 2016-04-21
    • 1970-01-01
    • 1970-01-01
    • 2014-12-13
    • 1970-01-01
    • 2017-05-31
    相关资源
    最近更新 更多