用于散列的 Hadoop Map Reduce 程序答案

【问题标题】：Hadoop Map Reduce program for hashing用于散列的 Hadoop Map Reduce 程序
【发布时间】：2014-09-23 09:37:33
【问题描述】：

我在 Hadoop 中编写了一个 Map Reduce 程序，用于对文件的所有记录进行哈希处理，并将 hased 值作为附加属性附加到每个记录，然后输出到 Hadoop 文件系统这是我写的代码

public class HashByMapReduce
{
public static class LineMapper extends Mapper<Text, Text, Text, Text>
{
    private Text word = new Text();

    public void map(Text key, Text value, Context context) throws IOException,    InterruptedException
      {
        key.set("single")
        String line = value.toString();
            word.set(line);
            context.write(key, line);

    }
}
public static class LineReducer
extends Reducer<Text,Text,Text,Text>
{
    private Text result = new Text();
    public void reduce(Text key, Iterable<Text> values,
    Context context
    ) throws IOException, InterruptedException
    {
        String translations = "";
        for (Text val : values)
        {
            translations = val.toString()+","+String.valueOf(hash64(val.toString())); //Point of Error 

        result.set(translations);
        context.write(key, result);
        }
    }
}
public static void main(String[] args) throws Exception
{
    Configuration conf = new Configuration();
    Job job = new Job(conf, "Hashing");
    job.setJarByClass(HashByMapReduce.class);
    job.setMapperClass(LineMapper.class);
    job.setReducerClass(LineReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    job.setInputFormatClass(KeyValueTextInputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

我编写这段代码的逻辑是，每一行都由 Map 方法读取，该方法将所有值分配给单个键，然后传递给相同的 Reducer 方法。将每个值传递给 hash64() 函数。

但我看到它向哈希函数传递了一个空值（空值）。我不明白为什么？提前致谢

【问题讨论】：

你为什么用org.w3c.dom.Text？
对不起.. 不知道它是如何插入那里的@ThomasJungblut

标签： java hadoop mapreduce

【解决方案1】：

问题的原因很可能是由于使用了KeyValueTextInputFormat。来自Yahoo tutorial：

  InputFormat:          Description:       Key:                     Value:

  TextInputFormat       Default format;    The byte offset          The line contents 
                        reads lines of     of the line                            
                        text files

  KeyValueInputFormat   Parses lines       Everything up to the     The remainder of                      
                        into key,          first tab character      the line
                        val pairs

它破坏了tab 字符的输入行。我想您的台词中没有tab。结果，LineMapper 中的key 是一整行，而没有任何内容作为value 传递（不确定null 或空）。

从您的代码中，我认为您最好使用TextInputFormat 类作为输入格式，它产生的行偏移为key，完整的行为value。这应该可以解决您的问题。

编辑：我运行您的代码并进行了以下更改，它似乎工作正常：

将 inputformat 更改为 TextInputFormat 并相应更改 Mapper 的声明
在job 中添加了正确的setMapOutputKeyClass 和setMapOutputValueClass。这些不是强制性的，但经常会在运行时产生问题。
删除了您的 ket.set("single") 并为 Mapper 添加了一个私有外键。
由于您没有提供hash64方法的详细信息，我使用String.toUpperCase进行测试。

如果问题仍然存在，那么我确定您的哈希方法没有很好地处理null。

完整代码：

 import org.apache.hadoop.conf.Configuration;
 import org.apache.hadoop.fs.Path;
 import org.apache.hadoop.io.LongWritable;
 import org.apache.hadoop.io.Text;
 import org.apache.hadoop.mapreduce.Job;
 import org.apache.hadoop.mapreduce.Mapper;
 import org.apache.hadoop.mapreduce.Reducer;
 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

 public class HashByMapReduce {
 public static class LineMapper extends
        Mapper<LongWritable, Text, Text, Text> {
    private Text word = new Text();
    private Text outKey = new Text("single");

    public void map(LongWritable key, Text value, Context context)
            throws IOException, InterruptedException {
        String line = value.toString();
        word.set(line);
        context.write(outKey, word);
    }
}

public static class LineReducer extends Reducer<Text, Text, Text, Text> {
    private Text result = new Text();

    public void reduce(Text key, Iterable<Text> values, Context context)
            throws IOException, InterruptedException {
        String translations = "";
        for (Text val : values) {
            translations = val.toString() + ","
                    + val.toString().toUpperCase(); // Point of Error

            result.set(translations);
            context.write(key, result);
        }
    }
}

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = new Job(conf, "Hashing");
    job.setJarByClass(HashByMapReduce.class);
    job.setMapperClass(LineMapper.class);
    job.setReducerClass(LineReducer.class);
    job.setMapOutputKeyClass(Text.class);
    job.setMapOutputValueClass(Text.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(Text.class);
    job.setInputFormatClass(TextInputFormat.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
}

}

【讨论】：

然后我再次检查您的代码，但请记住KeyValueInputFormat 将始终产生错误结果，除非是故意的。
您使用的是哪个版本的 Hadoop？
你是金匠！！它起作用了.. 我错过了这些 job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class);