使用 python 运行 mapreduce 作业（MRjob）时，命令行上出现意外的参数错误答案

【问题标题】：Unexpected arguments error appearing on the command line when running mapreduce job (MRjob) using python使用 python 运行 mapreduce 作业（MRjob）时，命令行上出现意外的参数错误
【发布时间】：2020-08-13 15:51:46
【问题描述】：

我对这个过程相当陌生。我正在尝试在本地 Hadoop 集群（Hadoop 版本 3.2.1）上使用带有 csv 的 python 3.8 运行一个简单的 map-reduce 作业。我目前在 Windows 10（64 位）上运行它。我正在尝试做的目的是处理一个 csv 文件，在该文件中我将得到一个代表文件中前 10 名薪水的计数的输出，但它不起作用。

当我输入这个命令时：

$ python test2.py hdfs:///sample/salary.csv -r hadoop --hadoop-streaming-jar %HADOOP_HOME%/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar

输出报错：

No configs found; falling back on auto-configuration
No configs specified for hadoop runner
Looking for hadoop binary in C:\hdp\hadoop\hadoop-dist\target\hadoop-3.2.1\bin...
Found hadoop binary: C:\hdp\hadoop\hadoop-dist\target\hadoop-3.2.1\bin\hadoop.CMD
Using Hadoop version 3.2.1
Creating temp directory C:\Users\Name\AppData\Local\Temp\test2.Name.20200813.003240.345552
uploading working dir files to hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd...
Copying other local files to hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/
Running step 1 of 1...
  Found 2 unexpected arguments on the command line [hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/setup-wrapper.sh#setup-wrapper.sh, hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/test2.py#test2.py]
  Try -help for more information
  Streaming Command Failed!
Attempting to fetch counters from logs...
Can't fetch history log; missing job ID
No counters found
Scanning logs for probable cause of failure...
Can't fetch history log; missing job ID
Can't fetch task logs; missing application ID
Step 1 of 1 failed: Command '['C:\\hdp\\hadoop\\hadoop-dist\\target\\hadoop-3.2.1\\bin\\hadoop.CMD', 'jar', 'C:\\hdp\\hadoop\\hadoop-dist\\target\\hadoop-3.2.1/share/hadoop/tools/lib/hadoop-streaming-3.2.1.jar', '-files', 'hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/mrjob.zip#mrjob.zip,hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/setup-wrapper.sh#setup-wrapper.sh,hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/test2.py#test2.py', '-input', 'hdfs:///sample/salary.csv', '-output', 'hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/output', '-mapper', '/bin/sh -ex setup-wrapper.sh python3 test2.py --step-num=0 --mapper', '-combiner', '/bin/sh -ex setup-wrapper.sh python3 test2.py --step-num=0 --combiner', '-reducer', '/bin/sh -ex setup-wrapper.sh python3 test2.py --step-num=0 --reducer']' returned non-zero exit status 1.

这是我从上面的输出中得到的错误：

Found 2 unexpected arguments on the command line [hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/setup-wrapper.sh#setup-wrapper.sh, hdfs:///user/Name/tmp/mrjob/test2.Name.20200813.003240.345552/files/wd/test2.py#test2.py]

这是python文件test2.py：

from mrjob.job import MRJob
from mrjob.step import MRStep
import csv

cols = 'Name,JobTitle,AgencyID,Agency,HireDate,AnnualSalary,GrossPay'.split(',')

class salarymax(MRJob):

    def mapper(self, _, line):
        # Convert each line into a dictionary
        row = dict(zip(cols, [a.strip() for a in csv.reader([line]).next()]))

        # Yield the salary
        yield 'salary', (float(row['AnnualSalary'][1:]), line)

        # Yield the gross pay
        try:
            yield 'gross', (float(row['GrossPay'][1:]), line)
        except ValueError:
            self.increment_counter('warn', 'missing gross', 1)

    def reducer(self, key, values):
        topten = []

        # For 'salary' and 'gross' compute the top 10
        for p in values:
            topten.append(p)
            topten.sort()
            topten = topten[-10:]

        for p in topten:
            yield key, p

    combiner = reducer


if __name__ == '__main__':
    salarymax.run()

我查看了这个 StackOverflow 问题，How to run a MRJob in a local Hadoop Cluster with Hadoop Streaming? question，但它并没有解决我的错误。

我查看了 setup-wrapper.sh 文件，因为这是突出显示错误的地方。当我检查时，它似乎没有任何问题。

我不明白错误是什么。有办法解决吗？

【问题讨论】：

你有没有找到解决这个问题的方法？我也有同样的经历。
@Cassova 很遗憾没有。我希望你自己找到它！祝你好运！

标签： python hadoop mapreduce hdfs hadoop-streaming

【解决方案1】：

我遇到了同样的问题，重新安装 Java JDK 为我解决了这个问题。我最初将它安装到C:\Program Files\Java，但根据一些说明将它移动到C:\Java。我认为更新环境变量就足够了，但显然不是。所以我卸载了Java并重新安装了它。这次是C:\Java，它解决了我的问题。

【讨论】：