编写 STDOUT 时，python 子进程模块挂起 spark-submit 命令答案

【问题标题】：python subprocess module hangs for spark-submit command when writing STDOUT编写 STDOUT 时，python 子进程模块挂起 spark-submit 命令
【发布时间】：2017-02-22 22:39:24
【问题描述】：

我有一个 python 脚本，用于使用 spark-submit 工具提交 spark 作业。我想执行命令并将输出实时写入 STDOUT 和日志文件。我在 ubuntu 服务器上使用 python 2.7。

这就是我的 SubmitJob.py 脚本中的内容

#!/usr/bin/python

# Submit the command
def submitJob(cmd, log_file):
    with open(log_file, 'w') as fh:
        process = subprocess.Popen(cmd, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
        while True:
            output = process.stdout.readline()
            if output == '' and process.poll() is not None:
                break
            if output:
                print output.strip()
                fh.write(output)
        rc = process.poll()
        return rc

if __name__ == "__main__":
    cmdList = ["dse", "spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]
    log_file = "/tmp/out.log"
    exist_status = submitJob(cmdList, log_file)
    print "job finished with status ",exist_status

奇怪的是，当我在 shell 中直接执行相同的命令时，它可以正常工作并在程序进行时在屏幕上产生输出。

所以看起来我将 subprocess.PIPE 用于标准输出和写入文件的方式有问题。

目前推荐的使用子进程模块逐行实时写入标准输出和日志文件的方法是什么？我在互联网上看到一堆选项，但不确定哪个是正确的或最新的。

谢谢

【问题讨论】：

您的 for 循环可能会更薄一些，否则应该这样做。我不知道 spark 或它对 stdout 的作用，但这可能是更好的地方。我认为你应该添加一个spark 标签。并且可能删除bash 标签。

标签： python linux python-2.7 apache-spark subprocess

【解决方案1】：

打印 Spark 日志可以调用user330612给出的commandList

  cmdList = ["spark-submit", "--spark-master", "spark://127.0.0.1:7077", "--class", "com.spark.myapp", "./myapp.jar"]

然后就可以使用子进程打印了，记得使用communicate()防止死锁https://docs.python.org/2/library/subprocess.html 警告使用 stdout=PIPE 和/或 stderr=PIPE 时出现死锁，并且子进程会向管道生成足够的输出，从而阻塞等待 OS 管道缓冲区接受更多数据。使用communicate() 来避免这种情况。下面是打印日志的代码。

import subprocess
p = subprocess.Popen(cmdList,stdout=subprocess.PIPE,stdout=subprocess.PIPE,stderr=subprocess.PIPE)
stdout, stderr = p.communicate() 
stderr=stderr.splitlines()
stdout=stdout.splitlines()
for line in stderr:
    print line  #now it can be printed line by line to a file or something else, for the log
for line in stdout:
    print line #for the output

有关子流程和打印线的更多信息，请访问： https://pymotw.com/2/subprocess/

【讨论】：

【解决方案2】：

找出问题所在。我试图将两个标准输出和标准错误重定向到管道以显示在屏幕上。当存在标准错误时，这似乎会阻止标准输出。如果我从 Popen 中删除 stderr=stdout 参数，它工作正常。所以对于 spark-submit 看起来你不需要显式重定向 stderr ，因为它已经隐式地这样做了

【讨论】：

有人知道这是 spark-submit 中的错误还是 Python 模块子进程中的错误？
我相信这是因为spark-submit 将其大量输出重定向到 stderr，因此打印到 stdout 不会让您得到脚本的实际输出