在 App Engine CRON 上部署 Google DataFlow 作业时出错答案

【问题标题】：Error when deploying Google DataFlow job on App Engine CRON在 App Engine CRON 上部署 Google DataFlow 作业时出错
【发布时间】：2017-10-18 09:43:21
【问题描述】：

（续previous question）

我正在尝试按照here 描述的方法部署一个 google 数据流作业，以在 google 应用引擎上将其作为 cron 作业运行。

我在 pipelines/script.py 文件夹中有一个 DataFlow 脚本（用 python 编写）。在本地（使用 Apache Beam DirectRunner）或在谷歌云（使用 DataFlowRunner）上运行此脚本可以正常工作。但是当部署作业在应用引擎上定期运行时，作业在执行时会引发以下错误：

(4cb822d7f796239a): Traceback (most recent call last):   File
"/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py",
line 582, in do_work
    work_executor.execute()   File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py",
line 166, in execute
    op.start()   File "apache_beam/runners/worker/operations.py", line 294, in apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:10607)
    def start(self):   File "apache_beam/runners/worker/operations.py", line 295, in
apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:10501)
    with self.scoped_start_state:   File "apache_beam/runners/worker/operations.py", line 300, in
apache_beam.runners.worker.operations.DoOperation.start
(apache_beam/runners/worker/operations.c:9702)
    pickler.loads(self.spec.serialized_fn))   File "/usr/local/lib/python2.7/dist-
packages/apache_beam/internal/pickler.py", line 225, in loads
    return dill.loads(s)   File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 277, in
loads
    return load(file)   File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in
load
    obj = pik.load()   File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)   File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)   File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in
find_class
    return StockUnpickler.find_class(self, module, name)   File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
    __import__(module) ImportError: No module named pipelines.spanner_backup

这是在谷歌云控制台的数据流面板中直接访问作业时可见的堆栈跟踪。但是，如果我单击“堆栈跟踪”以查看“堆栈驱动程序错误报告”面板中的错误堆栈跟踪，我会看到以下跟踪：

Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py", line 738, in run
    work, execution_context, env=self.environment)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/workitem.py", line 130, in get_work_items
    work_item_proto.sourceOperationTask.split)
  File "/usr/local/lib/python2.7/dist-packages/dataflow_worker/workercustomsources.py", line 142, in __init__
    source_spec[names.SERIALIZED_SOURCE_KEY]['value'])
  File "/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py", line 225, in loads
    return dill.loads(s)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 277, in loads
    return load(file)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 266, in load
    obj = pik.load()
  File "/usr/lib/python2.7/pickle.py", line 858, in load
    dispatch[key](self)
  File "/usr/lib/python2.7/pickle.py", line 1090, in load_global
    klass = self.find_class(module, name)
  File "/usr/local/lib/python2.7/dist-packages/dill/dill.py", line 423, in find_class
    return StockUnpickler.find_class(self, module, name)
  File "/usr/lib/python2.7/pickle.py", line 1124, in find_class
    __import__(module)
ImportError: No module named spanner.client

在工作人员之间共享内容时提示一些导入错误？不过 Google Spanner 应该已正确安装。

我正在使用：

Flask==0.12.2 
apache-beam[gcp]==2.1.1 
gunicorn==19.7.1 
gevent==1.2.1
google-cloud-dataflow==2.1.1 
google-cloud-spanner==0.26

我错过了什么吗？

编辑：我的 setup.py 如下：（如here 所述，对应的 github 链接与 cmets here）

from distutils.command.build import build as _build
import subprocess
import setuptools

class build(_build):  # pylint: disable=invalid-name
  sub_commands = _build.sub_commands + [('CustomCommands', None)]

CUSTOM_COMMANDS = [
    ['echo', 'Custom command worked!']]


class CustomCommands(setuptools.Command):
  """A setuptools Command class able to run arbitrary commands."""

  def initialize_options(self):
    pass

  def finalize_options(self):
    pass

  def RunCustomCommand(self, command_list):
    print 'Running command: %s' % command_list
    p = subprocess.Popen(
        command_list,
        stdin=subprocess.PIPE, stdout=subprocess.PIPE, stderr=subprocess.STDOUT)
    # Can use communicate(input='y\n'.encode()) if the command run requires
    # some confirmation.
    stdout_data, _ = p.communicate()
    print 'Command output: %s' % stdout_data
    if p.returncode != 0:
      raise RuntimeError(
          'Command %s failed: exit code: %s' % (command_list, p.returncode))

  def run(self):
    for command in CUSTOM_COMMANDS:
      self.RunCustomCommand(command)

REQUIRED_PACKAGES = ["Flask==0.12.2",
                        "apache-beam[gcp]==2.1.1",
                        "gunicorn==19.7.1",
                        "gevent==1.2.1",
                        "google-cloud-dataflow==2.1.1",
                        "google-cloud-spanner==0.26"
                    ]

setuptools.setup(
    name='dataflow_python_pipeline',
    version='1.0.0',
    description='DataFlow Python Pipeline',
    install_requires=REQUIRED_PACKAGES,
    packages=setuptools.find_packages(),
    cmdclass={
        'build': build,
        'CustomCommands': CustomCommands,
        }
    )

【问题讨论】：

管道选项中有--save_main_session 吗？如果是，请尝试删除它
感谢重新格式化。我这样做了，从我的计算机提交作业时，需要使用 DataflowRunner 运行作业。但是，删除它会导致相同的错误。
好的，请把setup.py的内容也加进去。
我添加了它。我是否应该在 setup.py 文件的“CUSTOM_COMMAND”中为我需要的所有模块添加“pip instal ***”？
这个或者你可以尝试用你的模块填充REQUIRED_PACKAGES，像这样：REQUIRED_PACKAGES=["google-cloud-spanner==0.26", "another-module==1.0"]等等......

标签： python google-app-engine google-cloud-dataflow apache-beam

【解决方案1】：

这是我的问题的解决方案，记录在案。感谢 Marcin Zabloki 帮助我。

看来我没有正确地将安装文件链接到管道。通过替换

pipeline_options = PipelineOptions()
pipeline_options.view_as(SetupOptions).save_main_session = True
pipeline_options.view_as(SetupOptions).requirements_file = "requirements.txt"
google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = '%s/staging' % BUCKET_URL
google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL
pipeline_options.view_as(StandardOptions).runner = 'DataflowRunner'

由

pipeline_options = PipelineOptions()
pipeline_options.view_as(SetupOptions).setup_file = "./setup.py"
google_cloud_options = pipeline_options.view_as(GoogleCloudOptions)
google_cloud_options.project = PROJECT_ID
google_cloud_options.job_name = JOB_NAME
google_cloud_options.staging_location = '%s/staging' % BUCKET_URL
google_cloud_options.temp_location = '%s/tmp' % BUCKET_URL
pipeline_options.view_as(StandardOptions).runner = 'DataflowRunner'

（并在 setup.py 文件中而不是在 requirements.txt 中添加要安装的模块）以及在 ParDos 中而不是在文件开头加载我在本地使用的模块，我能够部署脚本。

不这样做似乎会导致一些奇怪的、未定义的行为（例如函数找不到在同一文件中定义的类），而不是清除错误消息。

【讨论】：