使用 Hadoop 计数器 - 多个作业答案

【问题标题】：Using Hadoop Counters - Multiple jobs使用 Hadoop 计数器 - 多个作业
【发布时间】：2016-07-13 18:31:20
【问题描述】：

我正在使用 Hadoop 开发一个 mapreduce 项目。我目前有 3 个连续的工作。

我想使用 Hadoop 计数器，但问题是我想在第一个作业中进行实际计数，但在第三个作业的减速器中访问计数器值。

我怎样才能做到这一点？我应该在哪里定义enum？我需要通过它扔第二份工作吗？由于我还找不到任何东西，因此查看一些代码示例也会有所帮助。

注意：我使用的是 Hadoop 2.7.2

编辑：我已经尝试过here 解释的方法，但没有成功。我的情况不同，因为我想从不同的工作访问计数器。（不是从映射器到减速器）。

我尝试做的事情：第一份工作：

public static void startFirstJob(String inputPath, String outputPath) throws IOException, ClassNotFoundException, InterruptedException {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "wordCount");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(WordCountMapper.class);
    job.setCombinerClass(WordCountReducer.class);
    job.setReducerClass(WordCountReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(LongWritable.class);
    job.setInputFormatClass(SequenceFileInputFormat.class);
    job.setOutputFormatClass(SequenceFileOutputFormat.class);
    FileInputFormat.addInputPath(job, new Path(inputPath));
    FileOutputFormat.setOutputPath(job, new Path(outputPath));
    job.waitForCompletion(true);
}

在不同的类中定义了计数器枚举：

public class CountersClass {
    public static enum N_COUNTERS {
        SOMECOUNT
    }
}

试图读取计数器：

Cluster cluster = new Cluster(context.getConfiguration());
Job job = cluster.getJob(JobID.forName("wordCount"));
Counters counters = job.getCounters();
CountersClass.N_COUNTERS mycounter = CountersClass.N_COUNTERS.valueOf("SOMECOUNT");
Counter c1 = counters.findCounter(mycounter);
long N_Count = c1.getValue();

【问题讨论】：

Is there a way to access number of successful map tasks from a reduce task in an MR job?的可能重复
我认为在 reduce 工作中使用计数器不是一个好主意。见stackoverflow.com/questions/8009802/…
是的，我已经看到了，我尝试了这种方法。但在这种情况下，他希望将计数器放在减速器内（相同的工作）。这和我的情况不一样。

标签： java hadoop mapreduce counter

【解决方案1】：

经典解决方案是将作业的计数器值放入您需要访问它的后续作业的配置中：

所以请确保在计数作业映射器/reducer 中正确递增：

context.getCounter(CountersClass.N_COUNTERS.SOMECOUNT).increment(1);

然后在计算作业完成后：

job.waitForCompletion(true);

Counter someCount = job.getCounters().findCounter(CountersClass.N_COUNTERS.SOMECOUNT);

//put counter value into conf object of the job where you need to access it
//you can choose any name for the conf key really (i just used counter enum name here)
job2.getConfiguration().setLong(CountersClass.N_COUNTERS.SOMECOUNT.name(), someCount.getValue());

下一部分是在另一个作业的映射器/减速器中访问它。只需覆盖 setup() 例如：

private long someCount;

@Override
protected void setup(Context context) throws IOException,
    InterruptedException {
  super.setup(context);
  this.someCount  = context.getConfiguration().getLong(CountersClass.N_COUNTERS.SOMECOUNT.name(), 0));
}

【讨论】：

谢谢！如果我在这个enum 中有多个柜台怎么办？我可以用setEnum 和getEnum 替换setLong 和getLong 吗？或者我需要按照你对所有柜台说的做？
每个枚举项应该对应一个单独的配置键。您仍然使用 setLong getLong 通过它们各自的键访问它们
我知道这是个老问题。但是让我们假设作业在一些延迟后启动，延迟的作业不会覆盖在集群上运行时由较早启动的作业设置的计数器吗？
上面的答案假设从同一个 jvm 实例上的驱动程序执行了 2 个作业。如果您正在谈论从以前的工作中访问计数器，您最好将其结果存储在某个地方以便以后访问它。

【解决方案2】：

在第一个作业结束时获取计数器并将它们的值写入文件并在后续作业中读取。如果您想从 reducer 中读取，则将其写入 HDFS；如果您将在应用程序代码中读取和初始化，则将其写入本地文件。

Counters counters = job.getCounters(); Counter c1 = counters.findCounter(COUNTER_NAME); System.out.println(c1.getDisplayName()+":"+c1.getValue());

读写文件是基础教程的一部分。

【讨论】：

这可能是一个选择。您能否添加所需的代码部分？谢谢