【发布时间】:2020-08-05 09:03:47
【问题描述】:
我有一个包含专利信息的大文件。标题如下"PATENT","GYEAR","GDATE","APPYEAR","COUNTRY","POSTATE","ASSIGNEE","ASSCODE","CLAIMS"。
我想按年份计算每项专利的平均权利要求,其中 key 是年份,value 是平均金额。但是,reducer 输出显示我的平均数量一直是 1.0。我的程序哪里出错了?
主类
public static void main(String [] args) throws Exception{
int res = ToolRunner.run(new Configuration(), new AvgClaimsByYear(), args);
System.exit(res);
}
驱动类
Configuration config = this.getConf();
Job job = Job.getInstance(config, "average claims per year");
job.setJarByClass(AvgClaimsByYear.class);
job.setMapperClass(TheMapper.class);
job.setPartitionerClass(ThePartitioner.class);
job.setNumReduceTasks(4);
job.setReducerClass(TheReducer.class);
job.setOutputKeyClass(IntWritable.class);
job.setOutputValueClass(IntWritable.class);
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));
return job.waitForCompletion(true) ? 0 : 1;
映射器类
public static class TheMapper extends Mapper<LongWritable, Text, IntWritable, IntWritable> {
private IntWritable yearAsKeyOut = new IntWritable();
private IntWritable claimsAsValueOut = new IntWritable(1);
@Override
public void map(LongWritable keyIn, Text valueIn, Context context) throws IOException,InterruptedException {
String line = valueIn.toString();
if(line.contains("PATENT")) {
return; //skip header
}
else {
String [] patentData = line.split(",");
yearAsKeyOut.set(Integer.parseInt(patentData[1]));
if (patentData[8].length() > 0) {
claimsAsValueOut.set(Integer.parseInt(patentData[8]));
}
}
context.write(yearAsKeyOut, claimsAsValueOut);
}
}
分区器类
public static class ThePartitioner extends Partitioner<IntWritable, IntWritable> {
public int getPartition(IntWritable keyIn, IntWritable valueIn, int totalNumPartition) {
int theYear = keyIn.get();
if (theYear <= 1970) {
return 0;
}
else if(theYear > 1970 && theYear <= 1979) {
return 1;
}
else if(theYear > 1979 && theYear <=1989) {
return 2;
}
else{
return 3;
}
}
}
Reducer 类
public static class TheReducer extends Reducer<IntWritable,IntWritable,IntWritable,FloatWritable> {
@Override
public void reduce(IntWritable yearKey, Iterable<IntWritable> values, Context context) throws IOException,InterruptedException {
int totalClaimsThatYear = 0;
int totalPatentCountThatYear = 0;
FloatWritable avgClaim = new FloatWritable();
for(IntWritable value : values) {
totalClaimsThatYear += value.get();
totalPatentCountThatYear += 1;
}
avgClaim.set(calculateAvgClaimPerPatent (totalPatentCountThatYear, totalClaimsThatYear));
context.write(yearKey, avgClaim);
}
public float calculateAvgClaimPerPatent (int totalPatentCount, int totalClaims) {
return (float)totalClaims/totalPatentCount;
}
}
输入
3070801,1963,1096,,"BE","",,1,,269,6,69,,1,,0,,,,,,,
3070802,1963,1096,,"US","TX",,1,,2,6,63,,0,,,,,,,,,
3070803,1963,1096,,"US","IL",,1,,2,6,63,,9,,0.3704,,,,,,,
3070804,1963,1096,,"US","OH",,1,,2,6,63,,3,,0.6667,,,,,,,
3070805,1963,1096,,"US","CA",,1,,2,6,63,,1,,0,,,,,,,
输出
1963 1.0
1964 1.0
1965 1.0
1966 1.0
1967 1.0
1968 1.0
1969 1.0
1970 1.0
【问题讨论】:
-
根据代码,我认为它试图计算每年的平均权利要求,而不是按年计算每项专利的平均权利要求
-
为简单起见,您可以取消自定义分区程序。您可以创建专利+年份的复合键,并以声明为值。如果你愿意,你可以创建一个单独的键类,但我觉得你可以直接使用字符串连接来生成你的“复合”键。此外,将 combiner 类设置为 reducer 类将大大提高整体性能。但从代码的外观来看,您是在计算每年的权利要求,而不是每年每项专利。
-
嗨,我认为该方法的命名约定令人困惑。我不明白为什么减速器平均产生 1.0。我必须使用分区器将年份分成 4 个文件夹。