【问题标题】:Hadoop Streaming can't run pythonHadoop Streaming 无法运行 python
【发布时间】:2020-11-29 16:33:03
【问题描述】:

我正在尝试使用 python 代码执行带有 mapreduce 的 hadoop 流,但是它总是给出相同的错误结果,

File: file:/C:/py-hadoop/map.py is not readable

Error: java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1

我使用 hadoop 3.1.1python 3.8,以及 Windows 10 os

这是我的 map reduce 命令行

hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py,C:/py-hadoop/reduce.py -mapper "python map.py" -reducer "python reduce.py" -input /user/input -output /user/python-output

map.py

import sys

for line in sys.stdin:
    line = line.strip()
    words = line.split()
    for word in words:
        print ("%s\t%s" % (word, 1))

reduce.py

from operator import itemgetter
import sys

current_word = None
current_count = 0
word = None
clean = '0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ '

for line in sys.stdin:
    line = line.strip()
    word, count = line.split('\t', 1)
    try:
        count = int(count)
    except ValueError:
        continue
    word = filter(lambda x: x in clean, word).lower()
    if current_word == word:
        current_count += count
    else:
        if current_word:
            print ("%s\t%s" % (current_word, current_count))
        current_count = count
        current_word = word

if current_word == word:
    print ("%s\t%s" % (current_word, current_count))

也已经尝试过使用不同的命令行,比如

hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py -mapper "python map.py" -file C:/py-hadoop/reduce.py -reducer "python reduce.py" -input /user/input -output /user/python-output

hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file py-hadoop/map.py -mapper "python map.py" -file py-hadoop/reduce.py -reducer "python reduce.py" -input /user/input -output /user/python-output

但仍然给出完全相同的错误结果,

对不起,如果我的英语不好,我不是母语人士

【问题讨论】:

    标签: python python-3.x hadoop mapreduce hadoop-streaming


    【解决方案1】:

    已经修好了, 问题是由reduce.py引起的,这是我的新reduce.py

    import sys
    import collections
    
    counter = collections.Counter()
    
    for line in sys.stdin:
        word, count = line.strip().split("\t", 1)
    
        counter[word] += int(count)
    
    for x in counter.most_common(9999):
        print(x[0],"\t",x[1])
    

    这是我用来运行的命令行

    hadoop jar C:/hadoop/share/hadoop/tools/lib/hadoop-streaming-3.1.1.jar -file C:/py-hadoop/map.py -file C:/py-hadoop/reduce.py -mapper "python map.py" -reducer "python reduce.py" -input /user/input -output /user/python-output
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-08-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多