【问题标题】：Mapreduce Job -Taking too long to completeMapreduce 作业 - 完成时间太长
【发布时间】：2016-06-21 01:28:51
【问题描述】：

我们编写了一个 mapreduce 作业来处理日志文件。截至目前，我们有大约 52GB 的输入文件，但处理数据大约需要一个小时。默认情况下，它只创建一个 reducer 作业。通常我们会在 reduce 任务中看到超时错误，然后它会重新启动并完成.以下是成功完成工作的统计数据。请告诉我们如何提高性能。

File System Counters
            FILE: Number of bytes read=876100387
            FILE: Number of bytes written=1767603407
            FILE: Number of read operations=0
            FILE: Number of large read operations=0
            FILE: Number of write operations=0
            HDFS: Number of bytes read=52222279591
            HDFS: Number of bytes written=707429882
            HDFS: Number of read operations=351
            HDFS: Number of large read operations=0
            HDFS: Number of write operations=2
    Job Counters 
            Failed reduce tasks=1
            Launched map tasks=116
            Launched reduce tasks=2
            Other local map tasks=116
            Total time spent by all maps in occupied slots (ms)=9118125
            Total time spent by all reduces in occupied slots (ms)=7083783
            Total time spent by all map tasks (ms)=3039375
            Total time spent by all reduce tasks (ms)=2361261
            Total vcore-seconds taken by all map tasks=3039375
            Total vcore-seconds taken by all reduce tasks=2361261
            Total megabyte-seconds taken by all map tasks=25676640000
            Total megabyte-seconds taken by all reduce tasks=20552415744
    Map-Reduce Framework
            Map input records=49452982
            Map output records=5730971
            Map output bytes=864140911
            Map output materialized bytes=876101077
            Input split bytes=13922
            Combine input records=0
            Combine output records=0
            Reduce input groups=1082133
            Reduce shuffle bytes=876101077
            Reduce input records=5730971
            Reduce output records=5730971
            Spilled Records=11461942
            Shuffled Maps =116
            Failed Shuffles=0
            Merged Map outputs=116
            GC time elapsed (ms)=190633
            CPU time spent (ms)=4536110
            Physical memory (bytes) snapshot=340458307584
            Virtual memory (bytes) snapshot=1082745069568
            Total committed heap usage (bytes)=378565820416
    Shuffle Errors
            BAD_ID=0
            CONNECTION=0
            IO_ERROR=0
            WRONG_LENGTH=0
            WRONG_MAP=0
            WRONG_REDUCE=0
    File Input Format Counters 
            Bytes Read=52222265669
    File Output Format Counters 
            Bytes Written=707429882

如果我增加减速器的数量，我会收到如下的类转换异常。我猜这个问题来自分区器类。

java.lang.Exception: java.lang.ClassCastException: com.emaar.bigdata.exchg.logs.CompositeWritable cannot be cast to org.apache.hadoop.io.Text
    at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:522)
Caused by: java.lang.ClassCastException: com.emaar.bigdata.exchg.logs.CompositeWritable cannot be cast to org.apache.hadoop.io.Text
    at com.emaar.bigdata.exchg.logs.ActualKeyPartitioner.getPartition(ActualKeyPartitioner.java:1)
    at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:716)
    at org.apache.hadoop.mapreduce.task.TaskInputOutputContextImpl.write(TaskInputOutputContextImpl.java:89)
    at org.apache.hadoop.mapreduce.lib.map.WrappedMapper$Context.write(WrappedMapper.java:112)
    at com.emaar.bigdata.exchg.logs.ExchgLogsMapper.map(ExchgLogsMapper.java:56)
    at com.emaar.bigdata.exchg.logs.ExchgLogsMapper.map(ExchgLogsMapper.java:1)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)

我的分区器类

import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Partitioner;
import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;

public class ActualKeyPartitioner extends Partitioner<CompositeKey, Text> {

    HashPartitioner<Text, Text> hashPartitioner = new HashPartitioner<Text, Text>();
    Text newKey = new Text();

    @Override
    public int getPartition(CompositeKey key, Text value, int numReduceTasks) {

        try {
            // Execute the default partitioner over the first part of the key
            newKey.set(key.getSubject());
            return hashPartitioner.getPartition(newKey, value, numReduceTasks);
        } catch (Exception e) {
            e.printStackTrace();
            return (int) (Math.random() * numReduceTasks); // this would return
                                                            // a random value in
                                                            // the range
            // [0,numReduceTasks)
        }
    }
}

映射器代码

public class ExchgLogsMapper extends Mapper<LongWritable, List<Text>, CompositeKey, Writable> {
    String recepientAddresses = "";
    public static final String DELIVER = "DELIVER";
    public static final String RESOLVED = "Resolved";
    public static final String JUNK = "Junk E-mail";
    public static final String SEMICOLON = ";";
    public static final String FW1 = "FW: ";
    public static final String FW2 = "Fw: ";
    public static final String FW3 = "FWD: ";
    public static final String FW4 = "Fwd: ";
    public static final String FW5 = "fwd: ";
    public static final String RE1 = "RE: ";
    public static final String RE2 = "Re: ";
    public static final String RE3 = "re: ";


    Text mailType = new Text("NEW");
    Text fwType = new Text("FW");
    Text reType = new Text("RE");
    Text recepientAddr = new Text();

    @Override
    public void map(LongWritable key, List<Text> values, Context context) throws IOException, InterruptedException {
        String subj = null;
        int lstSize=values.size() ;
        if ((lstSize >= 26)) {
            if (values.get(8).toString().equals(DELIVER)) {
                if (!(ExclusionList.exclusions.contains(values.get(18).toString()))) {
                    if (!(JUNK.equals((values.get(12).toString())))) {
                        subj = values.get(17).toString();
                        recepientAddresses = values.get(11).toString();
                        String[] recepientAddressArr = recepientAddresses.split(SEMICOLON);
                        if (subj.startsWith(FW1) || subj.startsWith(FW2) || subj.startsWith(FW3)
                                || subj.startsWith(FW4) || subj.startsWith(FW5)) {
                            mailType = fwType;
                            subj = subj.substring(4);
                        } else if (subj.startsWith(RE1) || subj.startsWith(RE2) || subj.startsWith(RE3)) {
                            mailType = reType;
                            subj = subj.substring(4);
                        }
                        for (int i = 0; i < recepientAddressArr.length; i++) {
                            CompositeKey ckey = new CompositeKey(subj, values.get(0).toString());
                            recepientAddr.set(recepientAddressArr[i]);
                            CompositeWritable out = new CompositeWritable(mailType, recepientAddr, values.get(18),
                                    values.get(0));
                            context.write(ckey, out);
//                          System.err.println(out);

                        }
                    }
                }
            }
        }

【问题讨论】：

是什么阻止你增加减速器的数量？
我得到一个类转换异常，好像我增加了减速器的数量。我猜这个问题来自分区器类。我已经用更多细节编辑了这个问题
key.getSubject 是文本吗？你能试试 return (key.getSubject.hashCode() & Integer.MAX_VALUE) % numReduceTasks;而不是创建哈希分区器实例。这是 HashPartitioner 实现的代码。
ExchgLogsMapper com.emaar.bigdata.exchg.logs.ExchgLogsMapper.map(ExchgLogsMapper.java:56) 的第 56 行是什么
@ Arun... 第 56 行正在将输出写入上下文。 context.write(ckey, out);

标签： performance hadoop mapreduce

【解决方案1】：

循环内的 reducer 代码中几乎没有 sysout，它正在写入大量日志，删除它们后，reducer 会在几分钟内完成。！

【讨论】：