【发布时间】:2016-11-30 07:34:39
【问题描述】:
我正在学习 MapReduce,我想读取输入文件(逐句)并将每个句子写入输出文件,前提是它不包含单词“snake”。
例如输入文件:
This is my first sentence. This is my first sentence.
This is my first sentence.
The snake is an animal. This is the second sentence. This is my third sentence.
Another sentence. Another sentence with snake.
那么输出文件应该是:
This is my first sentence. This is my first sentence.
This is my first sentence.
This is the second sentence. This is my third sentence.
Another sentence.
为此,我在 map 方法中检查句子 (value) 是否包含单词蛇。如果句子中不包含snake这个词,那我把那句话写在context。
另外,我将 reducer 任务的数量设置为 0,否则在输出文件中我会以随机顺序获取句子(例如,第一句,然后是第三句,然后是第二句,依此类推)。
我的代码确实正确地过滤了带有蛇词的句子,但问题是它将每个句子写在一个新行中,如下所示:
This is my first sentence.
This is my first sentence.
This is my first sentence.
This is the second sentence.
This is my third sentence.
Another sentence.
.
只有当句子出现在输入文本的新行中时,我如何才能在新行中写一个句子?以下是我的代码:
public class RemoveSentence {
public static class SentenceMapper extends Mapper<Object, Text, Text, NullWritable>{
private Text removeWord = new Text ("snake");
public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
if (!value.toString().contains(removeWord.toString())) {
Text currentSentence = new Text(value.toString()+". ");
context.write(currentSentence, NullWritable.get());
}
}
}
public static void main(String[] args) throws Exception {
Configuration conf = new Configuration();
conf.set("textinputformat.record.delimiter", ".");
Job job = Job.getInstance(conf, "remove sentence");
job.setJarByClass(RemoveSentence.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
job.setMapOutputKeyClass(Text.class);
job.setMapOutputValueClass(NullWritable.class);
job.setMapperClass(SentenceMapper.class);
job.setNumReduceTasks(0);
System.exit(job.waitForCompletion(true) ? 0 : 1);
}
}
This 和 this other 解决方案说设置 context.write(word, null); 应该足够了,但在我的情况下不起作用。
另一个问题与conf.set("textinputformat.record.delimiter", "."); 有关。好吧,这就是我定义句子之间分隔符的方式,因此有时输出文件中的句子以空格开头(例如第二个This is my first sentence.)。作为替代方案,我尝试将其设置为 conf.set("textinputformat.record.delimiter", ". "); (句号后有一个空格),但这样 Java 应用程序不会在输出文件中写入所有句子。
【问题讨论】: