【问题标题】:Hadoop Streaming Python Trivial Example Not workingHadoop Streaming Python Trivial Example 不工作
【发布时间】:2013-10-07 19:54:17
【问题描述】:

我有一个输入文件长这样,已经上传到HDFS /tmp/input(用^A分隔,是非打印字符,这是VI中的视图)

A^A10
A^A7
A^A10
A^A5
A^A10
A^A8
B^A1
A^A9
B^A1
A^A9
B^A1
A^A9
B^A1    
A^A9
B^A1
A^A9
B^A1
A^A9

我写的映射器是这样的:

import sys
for line in sys.stdin:
    name, score = line.strip().split(chr(1))
    print '\t'.join([name, str(int(score)+1)])

reducer 看起来像这样 (similar to):

import sys
from datetime import datetime

def calc(inputList):
    return min(inputList)

def main():
    current_key = None
    value_list = []
    key = None
    value = None
    result = None
    for line in sys.stdin:
        try:
            line = line.strip()
            key, value = line.split('\t', 1)

            try:
                value = eval(value)
            except:
                continue
            if current_key == key:
                value_list.append(value)
            else:
                if current_key:
                    try:
                        result = str(calc(value_list))
                    except:
                        pass
                    print '%s\t%s' % (current_key, result )
                value_list = [value]
                current_key = key
        except:
        pass
    print '%s\t%s' % (current_key, str(calc(value_list)))

if __name__ == '__main__':
    main()

我在 shell 中测试了 mapper 和 reducer,它对我有用:

$ cat input | python mapper.py | sort -t$'\t' -k1 | python reducer.py 
A   6
B   2

但我未能使用 hadoop 流实现它:

/usr/bin/hadoop 
jar /opt/cloudera/parcels/CDH-4.3.0-1.cdh4.3.0.p0.22/lib/hadoop-0.20-mapreduce/contrib/streaming/hadoop-streaming-2.0.0-mr1-cdh4.3.0.jar 
-file mapper.py 
-mapper mapper.py  
-file reducer.py 
-reducer reducer.py 
-input /tmp/input 
-output /tmp/output

错误输出如下所示:

13/10/07 15:59:02 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/10/07 15:59:02 INFO mapred.FileInputFormat: Total input paths to process : 1
13/10/07 15:59:02 INFO streaming.StreamJob: getLocalDirs(): [/tmp/hadoop-a59347/mapred/local]
13/10/07 15:59:02 INFO streaming.StreamJob: Running job: job_201309301959_0089
13/10/07 15:59:02 INFO streaming.StreamJob: To kill this job, run:
13/10/07 15:59:02 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=url1:8021 -kill job_201309301959_0089
13/10/07 15:59:02 INFO streaming.StreamJob: Tracking URL: http://url1:50030/jobdetails.jsp?jobid=job_201309301959_0089
13/10/07 15:59:03 INFO streaming.StreamJob:  map 0%  reduce 0%
13/10/07 15:59:10 INFO streaming.StreamJob:  map 50%  reduce 0%
13/10/07 16:00:10 INFO streaming.StreamJob:  map 100%  reduce 0%
13/10/07 16:00:26 INFO streaming.StreamJob:  map 100%  reduce 1%
13/10/07 16:00:32 INFO streaming.StreamJob:  map 100%  reduce 2%
13/10/07 16:00:37 INFO streaming.StreamJob:  map 100%  reduce 100%
13/10/07 16:00:37 INFO streaming.StreamJob: To kill this job, run:
13/10/07 16:00:37 INFO streaming.StreamJob: UNDEF/bin/hadoop job  -Dmapred.job.tracker=url1:8021 -kill job_201309301959_0089
13/10/07 16:00:37 INFO streaming.StreamJob: Tracking URL: http://url1:50030/jobdetails.jsp?jobid=job_201309301959_0089
13/10/07 16:00:37 ERROR streaming.StreamJob: Job not successful. Error: NA
13/10/07 16:00:37 INFO streaming.StreamJob: killJob...
Streaming Command Failed!

知道我哪里做错了吗?

【问题讨论】:

  • 它是如何失败的?当你发出/usr/bin/hadoop jar ... 命令时,你能把打印在屏幕上的输出贴出来吗?
  • @cabad 谢谢提醒,是你需要的吗?

标签: python hadoop hdfs hadoop-streaming


【解决方案1】:

Hadoop 框架不知道如何运行您的 mapper 和 reducer。有两种可能的修复方法:

修复 1:显式调用 python。

-mapper "python mapper.py" -reducer "python reducer.py"

FIX 2:告诉 Hadoop 在哪里可以找到 Python 解释器。为此,您需要在*.py 文件的第一行明确告诉它在哪里可以找到它。例如:

#!/usr/bin/env python

但是请注意,python 并不总是在 /usr/bin 中(请参阅下面 copumpkin 的评论)。

【讨论】:

  • 还是一样的结果.. 不工作。我没有在 Hadoop Streaming wiki 页面上看到这个
  • @cabad - #!/usr/bin/env python#!/usr/bin/python 有什么区别?为什么使用env
  • @CJBS python 并不总是在/usr/bin 中,所以使用env$PATH 中查找它并允许它在其他地方。例如,有时它在/usr/local/bin/,甚至更不寻常的东西,比如/nix/store/12r38kqsdlgn9h1k49l43hzhjgrnkaxx-python-2.7.15/bin/!当您可以查找时,为什么要硬编码? ?
  • @copumpkin 添加了关于 python 位置的注释。
猜你喜欢
  • 1970-01-01
  • 2015-05-23
  • 1970-01-01
  • 2023-04-05
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多