【发布时间】:2015-05-05 12:24:12
【问题描述】:
我正在玩和学习 hadoop MapReduce。
我正在尝试从 VCF 文件 (http://en.wikipedia.org/wiki/Variant_Call_Format) 映射数据:VCF 是一个以(可能很大)标题开头的制表符分隔文件。获取正文中记录的语义需要此标头。
我想创建一个使用这些数据的映射器。必须可以从此 Mapper 访问标头才能对行进行解码。
从http://jayunit100.blogspot.fr/2013/07/hadoop-processing-headers-in-mappers.html,我创建了这个InputFormat,带有一个自定义阅读器:
public static class VcfInputFormat extends FileInputFormat<LongWritable, Text>
{
/* the VCF header is stored here */
private List<String> headerLines=new ArrayList<String>();
@Override
public RecordReader<LongWritable, Text> createRecordReader(InputSplit split,
TaskAttemptContext context) throws IOException,
InterruptedException {
return new VcfRecordReader();
}
@Override
protected boolean isSplitable(JobContext context, Path filename) {
return false;
}
private class VcfRecordReader extends LineRecordReader
{
/* reads all lines starting with '#' */
@Override
public void initialize(InputSplit genericSplit,
TaskAttemptContext context) throws IOException {
super.initialize(genericSplit, context);
List<String> headerLines=new ArrayList<String>();
while( super.nextKeyValue())
{
String row = super.getCurrentValue().toString();
if(!row.startsWith("#")) throw new IOException("Bad VCF header");
headerLines.add(row);
if(row.startsWith("#CHROM")) break;
}
}
}
}
现在,在 Mapper 中,有没有办法使用指向 VcfInputFormat.this.headerLines 的指针来解码行?
public static class VcfMapper
extends Mapper<LongWritable, Text, Text, IntWritable>{
public void map(LongWritable key, Text value, Context context ) throws IOException, InterruptedException {
my.VcfCodec codec=new my.VcfCodec(???????.headerLines);
my.Variant variant =codec.decode(value.toString());
//(....)
}
}
【问题讨论】:
标签: java hadoop mapreduce bioinformatics vcf-variant-call-format