【问题标题】:Packaging PySpark with PEX environment on dataproc在 dataproc 上使用 PEX 环境打包 PySpark
【发布时间】:2022-01-25 05:46:12
【问题描述】:

我正在尝试使用 PEX 打包 pyspark 作业以在 google cloud dataproc 上运行,但我收到了 Permission Denied 错误。

我已将我的第三方和本地依赖项打包到 env.pex 中,并将使用这些依赖项的入口点打包到 main.py 中。然后我 gsutil cp 这两个文件直到 gs://<PATH> 并运行下面的脚本。

from google.cloud import dataproc_v1 as dataproc
from google.cloud import storage

def submit_job(project_id: str, region: str, cluster_name: str):
    job_client = dataproc.JobControllerClient(
        client_options={"api_endpoint": f"{region}-dataproc.googleapis.com:443"}
    )
    operation = job_client.submit_job_as_operation(
        request={
            "project_id": project_id,
            "region": region,
            "job": {
                "placement": {"cluster_name": cluster_name},
                "pyspark_job": {
                    "main_python_file_uri": "gs://<PATH>/main.py",
                    "file_uris": ["gs://<PATH>/env.pex"],
                    "properties": {
                        "spark.pyspark.python": "./env.pex",
                        "spark.executorEnv.PEX_ROOT": "./.pex",
                    },
                },
            },
        }
    )

我得到的错误是

Exception in thread "main" java.io.IOException: Cannot run program "./env.pex": error=13, Permission denied
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97)
    at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:951)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1039)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1048)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.IOException: error=13, Permission denied
    at java.lang.UNIXProcess.forkAndExec(Native Method)
    at java.lang.UNIXProcess.<init>(UNIXProcess.java:247)
    at java.lang.ProcessImpl.start(ProcessImpl.java:134)
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1029)
    ... 14 more

我应该期望像这样打包我的环境吗?我看不到更改 pyspark 作业配置中包含为 file_uris 的文件权限的方法,而且我在谷歌云上没有看到任何关于使用 PEX 打包的文档,但 PySpark official docs include this guide

任何帮助表示赞赏 - 谢谢!

【问题讨论】:

    标签: google-cloud-platform pyspark google-cloud-dataproc dataproc python-pex


    【解决方案1】:

    您始终可以使用兼容的解释器运行 PEX 文件。因此,您可以尝试python env.pex,而不是指定./env.pex 的程序。这不需要env.pex 是可执行的。

    【讨论】:

      【解决方案2】:

      我最终无法直接运行 pex,但目前确实找到了一种解决方法,这是由pants slack community 中的用户建议的(谢谢!)...

      解决方法是在集群初始化脚本中将 pex 解压缩为 venv。

      初始化脚本gsutil复制到gs://&lt;PATH TO INIT SCRIPT&gt;

      #!/bin/bash
      
      set -exo pipefail
      
      readonly PEX_ENV_FILE_URI=$(/usr/share/google/get_metadata_value attributes/PEX_ENV_FILE_URI || true)
      readonly PEX_FILES_DIR="/pexfiles"
      readonly PEX_ENV_DIR="/pexenvs"
      
      function err() {
          echo "[$(date +'%Y-%m-%dT%H:%M:%S%z')]: $*" >&2
          exit 1
      }
      
      function install_pex_into_venv() {
          local -r pex_name=${PEX_ENV_FILE_URI##*/}
          local -r pex_file="${PEX_FILES_DIR}/${pex_name}"
          local -r pex_venv="${PEX_ENV_DIR}/${pex_name}"
      
          echo "Installing pex from ${pex_file} into venv ${pex_venv}..."
          gsutil cp "${PEX_ENV_FILE_URI}" "${pex_file}"
          PEX_TOOLS=1 python "${pex_file}" venv --compile "${pex_venv}"
      }
      
      function main() {
          if [[ -z "${PEX_ENV_FILE_URI}" ]]; then
              err "ERROR: Must specify PEX_ENV_FILE_URI metadata key"
          fi
      
          install_pex_into_venv
      }
      
      main
      

      启动集群并运行初始化脚本将 pex 解压到集群上的 venv 中:

      from google.cloud import dataproc_v1 as dataproc
      
      def start_cluster(project_id: str, region: str, cluster_name: str):
          cluster_client = dataproc.ClusterControllerClient(...)
          operation = cluster_client.create_cluster(
              request={
                  "project_id": project_id,
                  "region": region,
                  "cluster": {
                      "project_id": project_id,
                      "cluster_name": cluster_name,
                      "config": {
                          "master_config": <CONFIG>,
                          "worker_config": <CONFIG>,
                          "initialization_actions": [
                              {
                                  "executable_file": "gs://<PATH TO INIT SCRIPT>",
                              },
                          ],
                          "gce_cluster_config": {
                              "metadata": {"PEX_ENV_FILE_URI": "gs://<PATH>/env.pex"},
                          },
                      },
                  },
              }
          )
      
      

      启动作业并使用解压后的 pex venv 运行 pyspark 作业:

      def submit_job(project_id: str, region: str, cluster_name: str):
          job_client = dataproc.ClusterControllerClient(...)
          operation = job_client.submit_job_as_operation(
              request={
                  "project_id": project_id,
                  "region": region,
                  "job": {
                      "placement": {"cluster_name": cluster_name},
                      "pyspark_job": {
                          "main_python_file_uri": "gs://<PATH>/main.py",
                          "properties": {
                              "spark.pyspark.python": "/pexenvs/env.pex/bin/python",
                          },
                      },
                  },
              }
          )
      

      【讨论】:

        猜你喜欢
        • 2017-03-21
        • 1970-01-01
        • 1970-01-01
        • 2018-04-27
        • 2020-08-23
        • 2018-11-28
        • 2022-12-07
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多