如何在 Hadoop Reduce 中获取当前文件名答案

【问题标题】：How to get the current filename in Hadoop Reduce如何在 Hadoop Reduce 中获取当前文件名
【发布时间】：2013-12-18 02:40:28
【问题描述】：

我正在使用WordCount 示例，在Reduce 函数中，我需要获取文件名。

public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {
  public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    int sum = 0;
    while (values.hasNext()) {
      sum += values.next().get();
    }
    String filename = ((FileSplit)(.getContext()).getInputSplit()).getPath().getName();
    // ----------------------------^ I need to get the context and filename!
    key.set(key.toString() + " (" + filename + ")");
    output.collect(key, new IntWritable(sum));
  }
}

这是目前上面修改过的代码，我想在其中获取要为单词打印的文件名。我尝试关注Java Hadoop: How can I create mappers that take as input files and give an output which is the number of lines in each file?，但无法获取context 对象。

我是 hadoop 新手，需要帮助。有帮助吗？

【问题讨论】：

标签： java hadoop

【解决方案1】：

您无法获取context，因为context 是“新API”的构造，而您正在使用“旧API”。

请查看此字数统计示例：http://wiki.apache.org/hadoop/WordCount

在这种情况下查看reduce函数的签名：

public void reduce(Text key, Iterable<IntWritable> values, Context context)

看！上下文！请注意，在此示例中，它从 .mapreduce. 导入，而不是 .mapred.。

这对于新的 hadoop 用户来说是一个常见的问题，所以不要难过。通常，出于多种原因，您希望坚持使用新 API。但是，要非常小心你找到的例子。此外，请注意新 API 和旧 API 不可互操作（例如，您不能拥有新的 API 映射器和旧的 API 缩减器）。

【讨论】：

只是好奇 - 喜欢新 api 而不是旧 api 的原因是什么 - 我认为它们都会受到支持 - 也许我不是最新的。
旧API的reduce函数中如何获取文件名？

【解决方案2】：

使用旧的 MR API（org.apache.hadoop.mapred 包），将以下内容添加到 mapper/reducer 类中。

String fileName = new String();
public void configure(JobConf job)
{
    filename = job.get("map.input.file");
}

使用新的 MR API（org.apache.hadoop.mapreduce 包），将以下内容添加到 mapper/reducer 类中。

String fileName = new String();
protected void setup(Context context) throws java.io.IOException, java.lang.InterruptedException
{
    fileName = ((FileSplit) context.getInputSplit()).getPath().toString();
}

【讨论】：

【解决方案3】：

我用过这个方法，效果很好！！！

public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable> {
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {
    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
      String filename = fileSplit.getPath().getName();
      word.set(tokenizer.nextToken());
      output.collect(word, one);
    }
  }
}

如果我可以改进它，请告诉我！

【讨论】：