EMR 集群引导失败（超时）在我初始化集群的大多数时候都会发生答案

【问题标题】：EMR cluster bootstrap failure (timeout) occurs most of the times I initialize a clusterEMR 集群引导失败（超时）在我初始化集群的大多数时候都会发生
【发布时间】：2016-06-16 06:00:02
【问题描述】：

我正在编写一个由 4 个链接的 MapReduce 作业组成的应用程序，该作业在 Amazon EMR 上运行。我正在使用 JobFlow 接口来链接作业。每个作业都包含在自己的类中，并且有自己的main 方法。所有这些都打包到一个.jar 中，保存在S3 中，集群是从我笔记本电脑上的一个小型本地应用程序初始化的，它配置JobFlowRequest 并将其提交给EMR。对于我启动集群的大多数尝试，它都失败并显示错误消息Terminated with errors On the master instance (i-<cluster number>), bootstrap action 1 timed out executing。我查找了有关此问题的信息，我只能找到如果集群的组合引导时间超过 45 分钟，则会引发此异常。但是，这仅在请求提交到 EMR 后约 15 分钟发生，而不管请求的集群大小，无论是 4 个 EC2 实例、10 个还是什至 20 个。这对我来说根本没有意义，我错过了什么？

一些技术规格： -项目使用Java 1.7.79编译 - 请求的 EMR 映像是 4.6.0，它使用 Hadoop 2.7.2 -我正在使用 AWS SDK for Java v. 1.10.64

这是我本地的main方法，设置并提交JobFlowRequest:

import com.amazonaws.AmazonClientException;
import com.amazonaws.auth.AWSCredentials;
import com.amazonaws.auth.profile.ProfileCredentialsProvider;
import com.amazonaws.services.ec2.model.InstanceType;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduce;
import com.amazonaws.services.elasticmapreduce.AmazonElasticMapReduceClient;
import com.amazonaws.services.elasticmapreduce.model.*;

public class ExtractRelatedPairs {

public static void main(String[] args) throws Exception {

    if (args.length != 1) {
        System.err.println("Usage: ExtractRelatedPairs: <k>");
        System.exit(1);
    }
    int outputSize = Integer.parseInt(args[0]);
    if (outputSize < 0) {
        System.err.println("k should be positive");
        System.exit(1);
    }

    AWSCredentials credentials = null;
    try {
        credentials = new ProfileCredentialsProvider().getCredentials();
    } catch (Exception e) {
        throw new AmazonClientException(
                "Cannot load the credentials from the credential profiles file. " +
                        "Please make sure that your credentials file is at the correct " +
                        "location (~/.aws/credentials), and is in valid format.",
                e);
    }

    AmazonElasticMapReduce mapReduce = new AmazonElasticMapReduceClient(credentials);

    HadoopJarStepConfig jarStep1 = new HadoopJarStepConfig()
            .withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
            .withMainClass("Phase1")
          .withArgs("s3://datasets.elasticmapreduce/ngrams/books/20090715/eng-gb-all/5gram/data/", "hdfs:///output1/");



    StepConfig step1Config = new StepConfig()
            .withName("Phase 1")
            .withHadoopJarStep(jarStep1)
            .withActionOnFailure("TERMINATE_JOB_FLOW");

    HadoopJarStepConfig jarStep2 = new HadoopJarStepConfig()
            .withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
            .withMainClass("Phase2")
            .withArgs("shdfs:///output1/", "hdfs:///output2/");

    StepConfig step2Config = new StepConfig()
            .withName("Phase 2")
            .withHadoopJarStep(jarStep2)
            .withActionOnFailure("TERMINATE_JOB_FLOW");

    HadoopJarStepConfig jarStep3 = new HadoopJarStepConfig()
            .withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
            .withMainClass("Phase3")
            .withArgs("hdfs:///output2/", "hdfs:///output3/", args[0]);

    StepConfig step3Config = new StepConfig()
            .withName("Phase 3")
            .withHadoopJarStep(jarStep3)
            .withActionOnFailure("TERMINATE_JOB_FLOW");

    HadoopJarStepConfig jarStep4 = new HadoopJarStepConfig()
            .withJar("s3n://dsps162assignment2benasaf/jars/ExtractRelatedPairs.jar")
            .withMainClass("Phase4")
            .withArgs("hdfs:///output3/", "s3n://dsps162assignment2benasaf/output4");

    StepConfig step4Config = new StepConfig()
            .withName("Phase 4")
            .withHadoopJarStep(jarStep4)
            .withActionOnFailure("TERMINATE_JOB_FLOW");

    JobFlowInstancesConfig instances = new JobFlowInstancesConfig()
            .withInstanceCount(10)
            .withMasterInstanceType(InstanceType.M1Small.toString())
            .withSlaveInstanceType(InstanceType.M1Small.toString())
            .withHadoopVersion("2.7.2")
            .withEc2KeyName("AWS")
            .withKeepJobFlowAliveWhenNoSteps(false)
            .withPlacement(new PlacementType("us-east-1a"));

    RunJobFlowRequest runFlowRequest = new RunJobFlowRequest()
            .withName("extract-related-word-pairs")
            .withInstances(instances)
            .withSteps(step1Config, step2Config, step3Config, step4Config)
            .withJobFlowRole("EMR_EC2_DefaultRole")
            .withServiceRole("EMR_DefaultRole")
            .withReleaseLabel("emr-4.6.0")
            .withLogUri("s3n://dsps162assignment2benasaf/logs/");

    System.out.println("Submitting the JobFlow Request to Amazon EMR and running it...");
    RunJobFlowResult runJobFlowResult = mapReduce.runJobFlow(runFlowRequest);
    String jobFlowId = runJobFlowResult.getJobFlowId();
    System.out.println("Ran job flow with id: " + jobFlowId);

}
}

【问题讨论】：

引导操作失败意味着集群甚至没有完成启动并且还没有运行这些步骤。删除 withHadoopVersion，发布标签不需要。具有默认设置的普通 EMR 集群是否开始使用 Web 控制台？

标签： java amazon-web-services emr amazon-emr

【解决方案1】：

不久前，我遇到了一个类似的问题，即使是 4.6.0 的 Vanilla EMR 集群也无法通过启动，因此它在引导步骤中抛出了超时错误。

我最终只是在不同区域的不同/新 VPC 上创建了一个集群，它运行良好，因此我相信原始 VPC 本身或 4.6.0 中的软件可能存在问题.

此外，关于 VPC，它在为新创建的集群节点设置和解析 DNS 名称方面特别有问题，即使旧版本的 EMR 没有这个问题

【讨论】：