hadoop mapreduce Mapper 从文本文件中读取不正确的值答案

【问题标题】：hadoop mapreduce Mapper reading incorrect value from text filehadoop mapreduce Mapper 从文本文件中读取不正确的值
【发布时间】：2015-03-30 22:32:06
【问题描述】：

我正在编写一个 mapreduce 程序来处理一个文本文件，将一个字符串附加到每一行。我面临的问题是映射器的 map 方法中的文本值不正确。

当文件中的一行小于上一行时，会自动将几个字符附加到该行以使行长度等于上一个读取行。

映射方法参数如下

*@Override
protected void map(LongWritable key, Text value, Context context)
        throws IOException, InterruptedException {*

我正在记录 map 方法中的值并观察此行为。有什么指点吗？

代码片段

Driver

Configuration configuration = new Configuration();
        configuration.set("CLIENT_ID", "Test");
        Job job = Job.getInstance(configuration, JOB_NAME);
        job.setJarByClass(JobDriver.class);
        job.setMapperClass(AdwordsMapper.class);
        job.setInputFormatClass(TextInputFormat.class);
        job.setOutputFormatClass(TextOutputFormat.class);
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(Text.class);
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job, new Path(args[1]));
        FileOutputFormat.setCompressOutput(job, true);
        FileOutputFormat.setOutputCompressorClass(job, GzipCodec.class);

        System.exit(job.waitForCompletion(true) ? 0 : 1);


Mapper

public class AdwordsMapper extends Mapper<LongWritable, Text, Text, Text> {

    @Override
    protected void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String textLine = new String(value.getBytes());

        textLine = new StringBuffer(textLine).append(",")
                .append(context.getConfiguration().get("CLIENT_ID")).toString();
        context.write(new Text(""), new Text(textLine));

    }

}

【问题讨论】：

你能发布你的代码吗？
添加了驱动程序和映射器类的代码 sn-p

标签： java hadoop mapreduce

【解决方案1】：

据我所知，映射器中的问题是 getBytes();

而不是这个

   String textLine = new String(value.getBytes());

试试看。

   String textLine = value.toString();

【讨论】：

谢谢斯拉万。这解决了这个问题。关于输出键的另一个查询。当我输入空文本时，输出文件中有一个制表符。有没有办法在不生成任何额外字符的情况下指定一个键？
使用 NullWritable 作为密钥。另外我在输入文件夹中有多个文件。目前在处理所有输入文件后只生成一个输出文件。我们可以为每个输入文件生成 1 个输出文件
通过覆盖 mapper 中的 setup 方法使用基于文件输入路径的输出 riting。现在一切正常！！！