cleanup() 方法是如何工作的？答案

【问题标题】：How does cleanup() method work?cleanup() 方法是如何工作的？
【发布时间】：2018-06-26 04:44:38
【问题描述】：

我目前是 Hadoop 新手。所以我在 MapReduce 中解决了这段代码，它找出“一个国家中每年拥有最多‘数据工程师’工作的部分”（例如，如果格式为 (Year,Region ,Count(Jobs)) 是 "2016,'XYZ',35" 和 "2016,'ABC ',25" 和 "2015,'sdf',14"，答案将是 "2016,'XYZ' ,35" 和 "2015,'sdf',14")，但我无法理解减速器中的部分如下：-

    if (Top5DataEngineer.size() > 1)
            Top5DataEngineer.remove(Top5DataEngineer.firstKey());
    }//Ignore this bracket for the time being.

    protected void cleanup(Context context) throws IOException,
            InterruptedException {
        for (Text t : Top5DataEngineer.descendingMap().values())
            context.write(NullWritable.get(), t);
    }

这是完整的代码：-

    import java.io.IOException;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.io.NullWritable;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    import org.apache.hadoop.mapreduce.Partitioner;
    import java.util.TreeMap;
    import org.apache.hadoop.mapreduce.Reducer;

    public class Q_002a {
     public static class Q_002a_Mapper extends
        Mapper<LongWritable, Text, Text, LongWritable> {
    LongWritable one = new LongWritable(1);

    public void map(LongWritable key, Text values, Context context)
            throws IOException, InterruptedException {
        try {
            if (key.get() > 0)

            {

                String[] token = values.toString().split("\t");
                if (token[4].equals("DATA ENGINEER")) {
                    Text answer = new Text(token[8] + "\t" + token[7]);
                    context.write(answer, one);
                }
            }
        } catch (ArrayIndexOutOfBoundsException e) {
            System.out.println(e.getMessage());
        } catch (ArithmeticException e1) {
            System.out.println(e1.getMessage());

        }

    }

}

public static class Q_002a_Partitioner extends Partitioner<Text, LongWritable> {
    @Override
    public int getPartition(Text key, LongWritable value, int numReduceTasks) {
        String[] str = key.toString().split("\t");
        if (str[1].equals("2011"))
            return 0;
        if (str[1].equals("2012"))
            return 1;
        if (str[1].equals("2013"))
            return 2;
        if (str[1].equals("2014"))
            return 3;
        if (str[1].equals("2015"))
            return 4;
        if (str[1].equals("2016"))
            return 5;
        else
            return 6;
    }
}

public static class Q_002a_Reducer extends
        Reducer<Text, LongWritable, NullWritable, Text> {
    private TreeMap<LongWritable, Text> Top5DataEngineer = new TreeMap<LongWritable, Text>();
    long sum = 0;

    public void reduce(Text key, Iterable<LongWritable> values,
            Context context) throws IOException, InterruptedException {
        sum = 0;
        for (LongWritable val : values) {
            sum += val.get();
        }
        Top5DataEngineer.put(new LongWritable(sum), new Text(key + ","
                + sum));
        if (Top5DataEngineer.size() > 1)
            Top5DataEngineer.remove(Top5DataEngineer.firstKey());
    }

    protected void cleanup(Context context) throws IOException,
            InterruptedException {
        for (Text t : Top5DataEngineer.descendingMap().values())
            context.write(NullWritable.get(), t);
    }
}

public static void main(String args[]) throws IOException,
        InterruptedException, ClassNotFoundException {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "Top  5 Data Engineer in a worksite");

    job.setJarByClass(Q_002a.class);
    job.setMapperClass(Q_002a_Mapper.class);
    job.setPartitionerClass(Q_002a_Partitioner.class);
    job.setReducerClass(Q_002a_Reducer.class);

    job.setNumReduceTasks(6);

    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(LongWritable.class);

    job.setOutputKeyClass(NullWritable.class);
    job.setOutputValueClass(Text.class);

    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);

}
}

这是我得到的输出：-

编辑：- 我尝试在 reduce() 方法的 cleanup() 方法中运行代码，但它没有按预期工作。它仅在 cleanup() 方法中运行良好。对此的任何帮助将不胜感激。

【问题讨论】：

标签： hadoop mapreduce reduce

【解决方案1】：

cleanup() 方法将在处理阶段完成时调用。它只会被调用一次。

在您的示例中，reduce() 方法是“搜索”按城市划分的最大数据工程师职位总和。 Top5DataEngineer TreeMap 按排序（升序）顺序存储键，并且在每次迭代中，如果它有多个键，它只会删除第一个键（较小的键）。换句话说，在处理了Iterable<LongWritable> 值之后，您将获得一个在每个“年”分区中工作数量最多的城市。

当 reducer 阶段完成时，cleanup() 方法简单地写入每个已处理分区的结果（Top5DataEngineer 映射中的单个/最大 kv 对）。 cleanup() 方法将为每个“年”分区调用一次。

希望对你有所帮助。

【讨论】：