无法在 hadoop 作业中读取 bz2 压缩文件答案

【问题标题】：Cannot read bz2 compressed file in hadoop job无法在 hadoop 作业中读取 bz2 压缩文件
【发布时间】：2013-11-18 09:21:51
【问题描述】：

我有一个要在 MapReduce 作业中处理的 XML 文件。虽然我可以在未压缩时处理它，但当我将其压缩为 bz2 格式并将其存储在 hdfs 中时它不起作用。我是否需要进行一些更改，例如指定要使用的编解码器 - 我不知道在哪里做。任何例子都会很棒。我正在使用来自 mahaout 的 XMLInputFormat 来读取未压缩的 XML 文件。我使用bzip2 命令压缩文件并使用hadoop dfs -copyFromLocal 将文件复制到DFS。我有兴趣阅读和处理 xml 文档的 <page></page> 标记内的内容。我正在使用 hadoop-1.2.1 发行版。我可以看到有 FileOutputFormat.setOutputCompressorClass，但 FileInputFormat 没有类似的东西。

这是我工作的 Main 课程。

    public class Main extends Configured implements Tool {

        public static void main(String[] args) throws Exception {
            int res = ToolRunner.run(new Configuration(), new Main(), args);
            System.exit(res);
        }

        public int run(String[] args) throws Exception {

            if (args.length != 2) {
                System.err.println("Usage: hadoop jar XMLReaderMapRed "
                        + " [generic options] <in> <out>");
                System.out.println();
                ToolRunner.printGenericCommandUsage(System.err);
                return 1;
            }

            Job job = new Job(getConf(), "XMLTest");

            job.setInputFormatClass(MyXMLInputFormat.class);
            //Specify the start and end tag that has content
            getConf().set(MyXMLInputFormat.START_TAG_KEY, "<page>");
            getConf().set(MyXMLInputFormat.END_TAG_KEY, "</page>");

            job.setJarByClass(getClass());
            job.setMapperClass(XMLReaderMapper.class);
            job.setReducerClass(XmlReaderReducer.class);

            job.setMapOutputKeyClass(Text.class);
            job.setMapOutputValueClass(Text.class);

            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);

            FileInputFormat.addInputPath(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));

            boolean success = job.waitForCompletion(true);
            return success ? 0 : 1;
        }
    }

编辑：Reading from Hadoop - The Definitive Guide by Tom White，提到“如果您的输入文件被压缩，它们将在被 mapReduce 读取时自动解压缩，使用文件扩展名来确定要使用的编解码器。”所以文件是自动解压的，但是为什么输出目录会创建一个空文件呢？

谢谢！

【问题讨论】：

标签： xml hadoop mapreduce

【解决方案1】：

您应该查看您的 core-site.xml 配置文件并为 BZip2 编解码器添加一个类（如果不存在）。这是一个例子：

<property>
    <name>io.compression.codecs</name>
    <value>org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec,com.hadoop.compression.lzo.LzoCodec,com.hadoop.compression.lzo.LzopCodec,org.apache.hadoop.io.compress.SnappyCodec</value>
</property>

编辑：

添加编解码器后，请重现后续步骤以查看它是否有效（您的代码可能无效）：

hadoop fs -mkdir /tmp/wordcount/
echo "three one three three seven" >> /tmp/words
bzip2 -z /tmp/words
hadoop fs -put /tmp/words.bz2 /tmp/wordcount/
hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount /tmp/wordcount/ /tmp/wordcount_out/
hadoop fs -text /tmp/wordcount_out/part*
#you should see next three lines:
one     1
seven   1
three   3
#clean up
#this commands may be different in your case
hadoop fs -rmr /tmp/wordcount_out/
hadoop fs -rmr /tmp/wordcount/

【讨论】：

没有变化。我按照说明添加了编解码器 - lzo 不在我的机器上，所以删除了它们。工作完成后我仍然得到一个空文件。请让我知道我应该提供更多信息。
我猜，你的工作有问题，而不是配置。请使用 bz2 文件作为输入运行 wordcount 示例并告诉我会发生什么。该命令应该类似于：hadoop jar /usr/lib/hadoop/hadoop-examples.jar wordcount /tmp/wordcount/ /tmp/wordcount_out/ 并使用 'hadoop fs' util 检查输出：hadoop fs -text /tmp/wordcount_out/part*
请查看我更新的答案。我添加了检查 bz2 编解码器是否工作的确切步骤。
我已经用 hadoop 打包的 wordcount 作业检查了这一点。它可以工作并提供正确的输出。我使用XMLInputFormat 从 XML 文件中读取数据。这在文件未压缩时效果很好。

【解决方案2】：

在您的TextInputFormat 实现中，您可能会覆盖createRecordReader 并返回不考虑编解码器的RecordReader<KEYIN, VALUEIN> 的自定义实现。默认实现返回正确处理编解码器的LineRecordReader。可以找到参考实现here，相关改动需要here。

【讨论】：