MapReduce 中的最后一个 reducer 非常慢答案

【问题标题】：the last reducer is very slow in MapReduceMapReduce 中的最后一个 reducer 非常慢
【发布时间】：2018-01-19 23:07:30
【问题描述】：

最后一个reduce的速度很慢。另一个减少我的地图和减少的数量如下 map的数量是18784，reduce的数量是1500 每次减少的平均时间约为 1'26，但最后一次减少的时间约为 2h 我尝试改变减少的数量并减少工作的大小。但没有任何改变

the last reduce 至于我的分区

public int getPartition(Object key, Object value, int numPartitions) {
    // TODO Auto-generated method stub
    String keyStr = key.toString();
    int partId= String.valueOf(keyStr.hashCode()).hashCode();
    partId = Math.abs(partId % numPartitions);
    partId = Math.max(partId, 0);
    return partId;
    //return (key.hashCode() & Integer.MAX_VALUE) % numPartitions;
}

【问题讨论】：

标签： hadoop reduce

【解决方案1】：

其实在处理大量数据时，应该设置Combiner的类。如果你想改变编码，你应该重置 Reduce 功能。例如。

 public class GramModelReducer extends Reducer<Text, LongWritable, Text, LongWritable> {

private LongWritable result = new LongWritable();
public void reduce(Text key, Iterable<LongWritable> values,Context context) throws IOException, InterruptedException {

      long sum = 0;
      for (LongWritable val : values) {
        sum += val.get();
      }
      result.set(sum);
      context.write(new Text(key.toString().getBytes("GB18030")), result);
}

}

class GramModelCombiner extends Reducer<Text, LongWritable, Text, LongWritable> {
public void reduce(Text key, Iterable<LongWritable> values,Context context) throws IOException, InterruptedException {

      long sum = 0;
      for (LongWritable val : values) {
        sum += val.get();
      }
      context.write(key, new LongWritable(sum));
}

}

【讨论】：

【解决方案2】：

您很可能面临数据倾斜问题。

或者您的密钥没有很好地分配或者您的 getPartition 正在产生问题。我不清楚为什么要从字符串的哈希码创建一个字符串，然后获取这个新字符串的哈希码。我的建议是首先尝试使用默认分区，然后查看密钥的分布情况。

【讨论】：

【解决方案3】：

我有类似的经历，就我而言，这是因为只有一个 reduce 正在处理所有数据。这是由于数据偏斜而发生的。看一下已经处理过的减速器的计数器以及花费大量时间的减速器，您可能会看到越来越多的数据正在由花费大量时间的减速器处理。

你可能想调查一下。

Hadoop handling data skew in reducer

【讨论】：

谢谢。但是当我减少大约 10% 的数据大小并更改我的分区器时，我得到了相同的结果。最后一个 reduce 也很慢。
你看到它处理了多少数据了吗？它是否比其他 reducer 处理更多的数据？
谢谢。我找到了原因。我忘了设置 setCombinerClass 的类