Hadoop 完全跳过 reduce 阶段答案

【问题标题】：Hadoop is skipping reduce phase entirelyHadoop 完全跳过 reduce 阶段
【发布时间】：2015-12-04 22:12:52
【问题描述】：

我已经像这样设置了一个 Hadoop 作业：

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();

    Job job = Job.getInstance(conf, "Legion");
    job.setJarByClass(Legion.class);

    job.setMapperClass(CallQualityMap.class);
    job.setReducerClass(CallQualityReduce.class);

    // Explicitly configure map and reduce outputs, since they're different classes
    job.setMapOutputKeyClass(CallSampleKey.class);
    job.setMapOutputValueClass(CallSample.class);
    job.setOutputKeyClass(NullWritable.class);
    job.setOutputValueClass(Text.class);

    job.setInputFormatClass(CombineRepublicInputFormat.class);
    job.setOutputFormatClass(TextOutputFormat.class);

    CombineRepublicInputFormat.setMaxInputSplitSize(job, 128000000);
    CombineRepublicInputFormat.setInputDirRecursive(job, true);
    CombineRepublicInputFormat.addInputPath(job, new Path(args[0]));

    FileOutputFormat.setOutputPath(job, new Path(args[1]));

    job.waitForCompletion(true);
}

这项工作完成了，但发生了一些奇怪的事情。每条输入线都有一条输出线。每个输出行都包含来自 CallSampleKey.toString() 方法的输出，然后是选项卡，然后是 CallSample@17ab34d 之类的内容。

这意味着 reduce 阶段永远不会运行，CallSampleKey 和 CallSample 将直接传递给 TextOutputFormat。但我不明白为什么会这样。我已经很清楚地指定了job.setReducerClass(CallQualityReduce.class);，所以我不知道为什么它会跳过减速器！

编辑：这是减速器的代码：

public static class CallQualityReduce extends Reducer<CallSampleKey, CallSample, NullWritable, Text> {

    public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {
        Call call = new Call(inKey.getId().toString(), inKey.getUuid().toString());

        while (inValues.hasNext()) {
            call.addSample(inValues.next());
        }

        context.write(NullWritable.get(), new Text(call.getStats()));
    }
}

【问题讨论】：

请添加CallQualityMap和CallQualityReduce类的代码
我认为这无关紧要，但我继续添加了代码。谢谢。
你能附上这次运行的job.xml文件吗？
这是在 AWS Elastic MapReduce 上执行的。知道在哪里可以找到 job.xml 文件吗？如有必要，我可以通过 SSH 连接到主节点。
reducer 写入 null 键的事实告诉我 reducer 没有运行。输出与如果映射器输出直接进入输出格式会发生的情况一致，而不是在减速器爆炸时发生的情况。

标签： java hadoop mapreduce

【解决方案1】：

如果你试图改变你的

public void reduce(CallSampleKey inKey, Iterator<CallSample> inValues, Context context) throws IOException, InterruptedException {

使用Iterable 代替Iterator？

public void reduce(CallSampleKey inKey, Iterable<CallSample> inValues, Context context) throws IOException, InterruptedException {

然后您必须使用 inValues.iterator() 来获取实际的迭代器。

如果方法签名不匹配，那么它只是落入默认的identity reducer implementation。不幸的是，底层的默认实现并不容易检测到这种拼写错误，但最好的办法是在您打算覆盖的所有方法中始终使用@Override，以便编译器可以提供帮助。

【讨论】：

哇！现在在家，但我认为你是对的。将在早上进行测试。无法相信这是一个如此愚蠢的简单疏忽。如果这行得通，我会确保你得到你的甜蜜，甜蜜的互联网积分。
是的，就是这样。谢谢！