【问题标题】:hadoop streaming job fails in pythonhadoop 流作业在 python 中失败
【发布时间】:2011-04-22 19:53:03
【问题描述】:

我试图在 hadoop 中实现一个算法。 我试图在 hadoop 中执行部分代码,但流式作业失败

$ /home/hadoop/hadoop/bin/hadoop jar contrib/streaming/hadoop-*-streaming.jar -file /home/hadoop/hadoop/PR/mapper.py -mapper mapper.py -file /home/hadoop/hadoop/PR/reducer.py -reducer reducer.py -input pagerank/* -output PRoutput6

packageJobJar: [/home/hadoop/hadoop/PR/mapper.py, /home/hadoop/hadoop/PR/reducer.py, /home/hadoop/hadoop/tmp/dir/hadoop-hadoop/hadoop-unjar7101759175212283428/] [] /tmp/streamjob6286075675343269479.jar tmpDir=null

11/04/23 01:03:24 INFO mapred.FileInputFormat: Total input paths to process : 1

11/04/23 01:03:24 INFO streaming.StreamJob: getLocalDirs(): [/home/hadoop/hadoop/tmp/dir/hadoop-hadoop/mapred/local]

11/04/23 01:03:24 INFO streaming.StreamJob: Running job: job_201104222325_0021

11/04/23 01:03:24 INFO streaming.StreamJob: To kill this job, run:

11/04/23 01:03:24 INFO streaming.StreamJob: /home/hadoop/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201104222325_0021

11/04/23 01:03:24 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201104222325_0021

11/04/23 01:03:25 INFO streaming.StreamJob:  map 0%  reduce 0%

11/04/23 01:03:31 INFO streaming.StreamJob:  map 50%  reduce 0%

11/04/23 01:03:41 INFO streaming.StreamJob:  map 50%  reduce 17%

11/04/23 01:03:56 INFO streaming.StreamJob:  map 100%  reduce 100%

11/04/23 01:03:56 INFO streaming.StreamJob: To kill this job, run:

11/04/23 01:03:56 INFO streaming.StreamJob: /home/hadoop/hadoop/bin/../bin/hadoop job  -Dmapred.job.tracker=localhost:54311 -kill job_201104222325_0021

11/04/23 01:03:56 INFO streaming.StreamJob: Tracking URL: http://localhost:50030/jobdetails.jsp?jobid=job_201104222325_0021

11/04/23 01:03:56 ERROR streaming.StreamJob: Job not Successful!

11/04/23 01:03:56 INFO streaming.StreamJob: killJob...

Streaming Job Failed!

ma​​pper.py

#!/usr/bin/env python
import sys
import itertools

def ipsum(input_key,input_value_list):
   return sum(input_value_list)

n= 20 # works up to about 1000000 pages
i = {}
for j in xrange(n): i[j] = [1.0/n,0,[]]
j=0
u=0
for line in sys.stdin:
  if j<n:
    i[j][1]=int(line)
  j=j+1

  if j > n: 
    if line != "-1\n":
      i[u][2] = line.split(',')
    else: 
      i[u][2]=[]
    u=u+1
for j in xrange(n):
  if i[j][1] != 0:
    i[j][2] = map(int,i[j][2])    

intermediate=[]
for (input_key,input_value) in i.items():
  if input_value[1] == 0: intermediate.extend([(1,input_value[0])])
  else: intermediate.extend([])
grp = {}
for key, group in itertools.groupby(sorted(intermediate),lambda x: x[0]):
  grp[key] = list([y for x, y in group])
iplist = [ipsum(intermediate_key,grp[intermediate_key]) for intermediate_key in grp]
inter=[]
for (input_key,input_value) in i.items():
  if input_value[1] == 0: inter.extend([(input_key,0.0)]+[(outlink,input_value[0]/input_value[1]) for outlink in input_value[2]])
  else: inter.extend([])

for value in inter:
  value1 = value[0]
  value2 = value[1]
  print '%s %s' % (value1,value2)

reducer.py

#!/usr/bin/env python
import sys
import itertools
for line in sys.stdin:
  input_key, input_value=line.split(' ',1)
  input_key = input_key.strip()
  input_value = input_value.strip()
  input_key = int(input_key)
  input_value = float(input_value)
  print str(input_key)+' '+str(input_value)

我不知道错误是在我的代码中还是在 hadoop 配置中......因为我能够使用执行代码, $ cat /home/hadoop/hadoop/PR/pagerank/input.txt | python /home/hadoop/hadoop/PR/mapper.py |排序 | python /home/hadoop/hadoop/PR/reducer.py

不胜感激, 谢谢。

【问题讨论】:

    标签: python hadoop


    【解决方案1】:

    我猜你的数据可能是关键。从字符串或类似问题中转换浮点数可能会在您的真实数据中遇到问题,而这些数据不会出现在您的本地测试数据中。也许您可以通过异常处理或断言来解决。

    【讨论】:

      【解决方案2】:

      查看输出中的作业信息页面 url。在你的情况下, 本地主机:50030/jobdetails.jsp?jobid=job_201104222325_0021

      单击“失败的映射器”列和“最后 8KB”(或其他)日志链接中的数字,您将看到(最有可能的)您遇到的 python 异常。

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 2015-05-23
        • 2013-06-05
        • 2015-11-17
        • 1970-01-01
        • 2023-04-05
        • 1970-01-01
        • 2013-08-25
        相关资源
        最近更新 更多