【发布时间】:2012-12-16 03:30:46
【问题描述】:
我正在使用 Python 编写一个简单的流式映射缩减作业,以在 Amazon EMR 上运行。它基本上是用户记录的聚合器,将每个用户 ID 的条目分组在一起。
映射器
#!/usr/bin/env python
import sys
def main(argv):
line = sys.stdin.readline()
try:
while line:
line = line.rstrip()
elements = line.split()
print '%s\t%s' % (elements[0] , (elements[1],elements[2]) )
line = sys.stdin.readline()
except "end of file":
return None
if __name__ == '__main__':
main(sys.argv)
减速机:
#!/usr/bin/env python
import sys
def main(argv):
users=dict()
for line in sys.stdin:
elements=line.split('\t',1)
if elements[0] in users:
users[elements[0]].append(elements[1])
else:
users[elements[0]]=elements[1]
for user in users:
print '%s\t%s'% ( user, users[user] )
if __name__ == '__main__':
main(sys.argv)
此作业应在包含五个文本文件的目录上运行。 EMR作业的参数为:
输入:[桶名]/[输入文件夹名]
输出:[桶名]/输出
映射器:[桶名]/mapper.py
reducer:[桶名]/reducer.py
作业不断失败,原因是:步骤失败而关闭。这是日志的副本
2013-01-01 12:06:16,270 INFO org.apache.hadoop.mapred.JobClient (main): Default number of map tasks: null
2013-01-01 12:06:16,271 INFO org.apache.hadoop.mapred.JobClient (main): Setting default number of map tasks based on cluster size to : 8
2013-01-01 12:06:16,271 INFO org.apache.hadoop.mapred.JobClient (main): Default number of reduce tasks: 3
2013-01-01 12:06:18,392 INFO org.apache.hadoop.security.ShellBasedUnixGroupsMapping (main): add hadoop to shell userGroupsCache
2013-01-01 12:06:18,393 INFO org.apache.hadoop.mapred.JobClient (main): Setting group to hadoop
2013-01-01 12:06:18,647 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader (main): Loaded native gpl library
2013-01-01 12:06:18,670 WARN com.hadoop.compression.lzo.LzoCodec (main): Could not find build properties file with revision hash
2013-01-01 12:06:18,670 INFO com.hadoop.compression.lzo.LzoCodec (main): Successfully loaded & initialized native-lzo library [hadoop-lzo rev UNKNOWN]
2013-01-01 12:06:18,695 WARN org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library is available
2013-01-01 12:06:18,695 INFO org.apache.hadoop.io.compress.snappy.LoadSnappy (main): Snappy native library loaded
2013-01-01 12:06:19,050 INFO org.apache.hadoop.mapred.FileInputFormat (main): Total input paths to process : 5
2013-01-01 12:06:20,688 INFO org.apache.hadoop.streaming.StreamJob (main): getLocalDirs(): [/mnt/var/lib/hadoop/mapred]
2013-01-01 12:06:20,688 INFO org.apache.hadoop.streaming.StreamJob (main): Running job: job_201301011204_0001
2013-01-01 12:06:20,688 INFO org.apache.hadoop.streaming.StreamJob (main): To kill this job, run:
2013-01-01 12:06:20,688 INFO org.apache.hadoop.streaming.StreamJob (main): /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.255.131.225:9001 -kill job_201301011204_0001
2013-01-01 12:06:20,689 INFO org.apache.hadoop.streaming.StreamJob (main): Tracking URL: http://domU-12-31-39-01-7C-13.compute-1.internal:9100/jobdetails.jsp?jobid=job_201301011204_0001
2013-01-01 12:06:21,696 INFO org.apache.hadoop.streaming.StreamJob (main): map 0% reduce 0%
2013-01-01 12:08:02,238 INFO org.apache.hadoop.streaming.StreamJob (main): map 100% reduce 100%
2013-01-01 12:08:02,239 INFO org.apache.hadoop.streaming.StreamJob (main): To kill this job, run:
2013-01-01 12:08:02,240 INFO org.apache.hadoop.streaming.StreamJob (main): /home/hadoop/bin/hadoop job -Dmapred.job.tracker=10.255.131.225:9001 -kill job_201301011204_0001
2013-01-01 12:08:02,240 INFO org.apache.hadoop.streaming.StreamJob (main): Tracking URL: http://domU-12-31-39-01-7C-13.compute-1.internal:9100/jobdetails.jsp?jobid=job_201301011204_0001
2013-01-01 12:08:02,240 ERROR org.apache.hadoop.streaming.StreamJob (main): Job not successful. Error: # of failed Map Tasks exceeded allowed limit. FailedCount: 1. LastFailedTask: task_201301011204_0001_m_000002
2013-01-01 12:08:02,240 INFO org.apache.hadoop.streaming.StreamJob (main): killJob...
我做错了什么?
【问题讨论】:
-
您是否在输入中尝试过您的代码:cat input.txt | ./mapper.py |排序 | ./reducer.py > a.out
-
是的,我做到了。它在本地运行良好。
标签: python amazon-web-services mapreduce amazon-emr