【问题标题】:mapreduce matrix multiplication with hadoop使用hadoop进行mapreduce矩阵乘法
【发布时间】:2012-10-27 15:45:17
【问题描述】:

我正在尝试在以下链接上运行提到的矩阵乘法示例(带有源代码):

http://www.norstad.org/matrix-multiply/index.html

我在伪分布式模式下设置了 hadoop,并使用本教程对其进行了配置:

http://hadoop-tutorial.blogspot.com/2010/11/running-hadoop-in-pseudo-distributed.html?showComment=1321528406255#c3661776111033973764

当我运行我的 jar 文件时,我收到以下错误:

Identity test
11/11/30 10:37:34 INFO input.FileInputFormat: Total input paths to process : 2
11/11/30 10:37:34 INFO mapred.JobClient: Running job: job_201111291041_0010
11/11/30 10:37:35 INFO mapred.JobClient:  map 0% reduce 0%
11/11/30 10:37:44 INFO mapred.JobClient:  map 100% reduce 0%
11/11/30 10:37:56 INFO mapred.JobClient:  map 100% reduce 100%
11/11/30 10:37:58 INFO mapred.JobClient: Job complete: job_201111291041_0010
11/11/30 10:37:58 INFO mapred.JobClient: Counters: 17
11/11/30 10:37:58 INFO mapred.JobClient:   Job Counters
11/11/30 10:37:58 INFO mapred.JobClient:     Launched reduce tasks=1
11/11/30 10:37:58 INFO mapred.JobClient:     Launched map tasks=2
11/11/30 10:37:58 INFO mapred.JobClient:     Data-local map tasks=2
11/11/30 10:37:58 INFO mapred.JobClient:   FileSystemCounters
11/11/30 10:37:58 INFO mapred.JobClient:     FILE_BYTES_READ=114
11/11/30 10:37:58 INFO mapred.JobClient:     HDFS_BYTES_READ=248
11/11/30 10:37:58 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=298
11/11/30 10:37:58 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=124
11/11/30 10:37:58 INFO mapred.JobClient:   Map-Reduce Framework
11/11/30 10:37:58 INFO mapred.JobClient:     Reduce input groups=2
11/11/30 10:37:58 INFO mapred.JobClient:     Combine output records=0
11/11/30 10:37:58 INFO mapred.JobClient:     Map input records=4
11/11/30 10:37:58 INFO mapred.JobClient:     Reduce shuffle bytes=60
11/11/30 10:37:58 INFO mapred.JobClient:     Reduce output records=2
11/11/30 10:37:58 INFO mapred.JobClient:     Spilled Records=8
11/11/30 10:37:58 INFO mapred.JobClient:     Map output bytes=100
11/11/30 10:37:58 INFO mapred.JobClient:     Combine input records=0
11/11/30 10:37:58 INFO mapred.JobClient:     Map output records=4
11/11/30 10:37:58 INFO mapred.JobClient:     Reduce input records=4
11/11/30 10:37:58 INFO input.FileInputFormat: Total input paths to process : 1
11/11/30 10:37:59 INFO mapred.JobClient: Running job: job_201111291041_0011
11/11/30 10:38:00 INFO mapred.JobClient:  map 0% reduce 0%
11/11/30 10:38:09 INFO mapred.JobClient:  map 100% reduce 0%
11/11/30 10:38:21 INFO mapred.JobClient:  map 100% reduce 100%
11/11/30 10:38:23 INFO mapred.JobClient: Job complete: job_201111291041_0011
11/11/30 10:38:23 INFO mapred.JobClient: Counters: 17
11/11/30 10:38:23 INFO mapred.JobClient:   Job Counters
11/11/30 10:38:23 INFO mapred.JobClient:     Launched reduce tasks=1
11/11/30 10:38:23 INFO mapred.JobClient:     Launched map tasks=1
11/11/30 10:38:23 INFO mapred.JobClient:     Data-local map tasks=1
11/11/30 10:38:23 INFO mapred.JobClient:   FileSystemCounters
11/11/30 10:38:23 INFO mapred.JobClient:     FILE_BYTES_READ=34
11/11/30 10:38:23 INFO mapred.JobClient:     HDFS_BYTES_READ=124
11/11/30 10:38:23 INFO mapred.JobClient:     FILE_BYTES_WRITTEN=100
11/11/30 10:38:23 INFO mapred.JobClient:     HDFS_BYTES_WRITTEN=124
11/11/30 10:38:23 INFO mapred.JobClient:   Map-Reduce Framework
11/11/30 10:38:23 INFO mapred.JobClient:     Reduce input groups=2
11/11/30 10:38:23 INFO mapred.JobClient:     Combine output records=2
11/11/30 10:38:23 INFO mapred.JobClient:     Map input records=2
11/11/30 10:38:23 INFO mapred.JobClient:     Reduce shuffle bytes=0
11/11/30 10:38:23 INFO mapred.JobClient:     Reduce output records=2
11/11/30 10:38:23 INFO mapred.JobClient:     Spilled Records=4
11/11/30 10:38:23 INFO mapred.JobClient:     Map output bytes=24
11/11/30 10:38:23 INFO mapred.JobClient:     Combine input records=2
11/11/30 10:38:23 INFO mapred.JobClient:     Map output records=2
11/11/30 10:38:23 INFO mapred.JobClient:     Reduce input records=2
Exception in thread "main" java.io.IOException: Cannot open filename /tmp/Matrix Multiply/out/_logs
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.openInfo(DFSClient.ja va:1497)
        at org.apache.hadoop.hdfs.DFSClient$DFSInputStream.<init>(DFSClient.java :1488)
        at org.apache.hadoop.hdfs.DFSClient.open(DFSClient.java:376)
        at org.apache.hadoop.hdfs.DistributedFileSystem.open(DistributedFileSyst em.java:178)
        at org.apache.hadoop.io.SequenceFile$Reader.openFile(SequenceFile.java:1 437)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:142 4)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:141 7)
        at org.apache.hadoop.io.SequenceFile$Reader.<init>(SequenceFile.java:141 2)
        at TestMatrixMultiply.fillMatrix(TestMatrixMultiply.java:62)
        at TestMatrixMultiply.readMatrix(TestMatrixMultiply.java:84)
        at TestMatrixMultiply.checkAnswer(TestMatrixMultiply.java:108)
        at TestMatrixMultiply.runOneTest(TestMatrixMultiply.java:144)
        at TestMatrixMultiply.testIdentity(TestMatrixMultiply.java:156)
        at TestMatrixMultiply.main(TestMatrixMultiply.java:258)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:57)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:601)
        at org.apache.hadoop.util.RunJar.main(RunJar.java:156)

有人可以告诉我我做错了什么吗?谢谢

【问题讨论】:

    标签: hadoop mapreduce


    【解决方案1】:

    它尝试读取作业输出。当你将它提交到你的集群时,它会添加这个 _log 目录。由于目录不是序列文件,因此无法读取。

    您必须更改读取此内容的代码。

    我编写了相同的脚本:

    FileStatus[] stati = fs.listStatus(output);
    for (FileStatus status : stati) {
        if (!status.isDir()) {
            Path path = status.getPath();
            // HERE IS THE READ CODE FROM YOUR EXAMPLE
        }
    }
    

    http://code.google.com/p/hama-shortest-paths/source/browse/trunk/hama-gsoc/src/de/jungblut/clustering/mapreduce/KMeansClusteringJob.java#127

    【讨论】:

    • 谢谢托马斯。我猜对了。关于我在问题中提到的矩阵乘法示例,您是否有任何想法,为什么这在 hadoop 独立模式下可以正常工作,但在检查答案时不适用于 hadoop 分布式模式?
    • 因为在分布式模式下运行时,_log 存储在 HDFS 中,而在伪分布式模式/独立模式下,这将存储在 tasktracker 日志中。
    【解决方案2】:

    这可能是一个原始建议,但您可能需要更改日志文件名 /tmp/Matrix\ 乘法/输出/_logs。目录名称中的空格可能无法自动处理,我假设您正在使用 Linux。

    【讨论】:

    • 谢谢米特。是的,我正在使用 linux,问题是所有这些日志文件和路径都是由程序本身定义的。那么你仍然认为这可能是你描述的同一个问题吗?由于我对这个领域很陌生,所以如果我错了,请纠正我
    【解决方案3】:

    TestMatrixMultiply.java中有两个问题:

    1. 正如 Thomas Jungblut 所说,应该在 readMatrix() 方法中排除 _logs。我已经改变了这样的代码:

      if (fs.isFile(path)) {
              fillMatrix(result, path);
          } else {
              FileStatus[] fileStatusArray = fs.listStatus(path);
              for (FileStatus fileStatus : fileStatusArray) {
                  if ( !fileStatus.isDir() )  // this line is added by me
                      fillMatrix(result, fileStatus.getPath());
              }
          }
      
    2. 在main()方法的最后,fs.delete应该被注释掉,否则每次mapreduce作业完成后输出目录都会被立即删除。

      finally {
              //fs.delete(new Path(DATA_DIR_PATH), true);
          }
      

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-02-12
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2013-01-30
      • 1970-01-01
      相关资源
      最近更新 更多