【问题标题】:How to include a jar URI in a submit job function on Dataproc如何在 Dataproc 上的提交作业功能中包含 jar URI
【发布时间】:2020-02-04 21:40:39
【问题描述】:

我正在尝试通过 jupyter 运行 PySpark 作业,我需要创建一个函数来运行该作业。我需要传递一个 jar 文件,我正试图弄清楚如何做到这一点。 我确实找到了一些文档:https://cloud.google.com/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.SubmitJobRequest

https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/HadoopJob

但我无法确切地弄清楚如何将 URI 添加到函数中。我的函数目前看起来像这样:

from google.cloud import dataproc_v1

def submit_pyspark_job(dataproc_cluster_client, project, region, cluster_name, bucket_name,
                       filename):
    """Submit the Pyspark job to the cluster (assumes `filename` was uploaded
    to `bucket_name."""
    job_details = {
        'placement': {
            'cluster_name': cluster_name
        },
        'pyspark_job': {
            'jar_file_uris':'gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar', #PROBLEM HERE!
            'main_python_file_uri': 'gs://{}/{}'.format(bucket_name, filename)
        }
    }

    result = dataproc_cluster_client.submit_job(
        project_id=project, region=region, job=job_details)
    job_id = result.reference.job_id
    print('Submitted job ID {}.'.format(job_id))
    return job_id

问题在于作业详细信息参数的 jar_file_uris 部分。目前,我遇到了一个错误。

【问题讨论】:

    标签: google-cloud-platform pyspark google-cloud-dataproc jupyterhub


    【解决方案1】:

    所以我想出了解决办法。该函数应声明为:

    def submit_pyspark_job(dataproc_cluster_client, project, region, cluster_name, bucket_name,
                           filename):
        """Submit the Pyspark job to the cluster (assumes `filename` was uploaded
        to `bucket_name."""
        job_details = {
            'placement': {
                'cluster_name': cluster_name
            },
            'pyspark_job': {
                'main_python_file_uri': 'gs://{}/{}'.format(bucket_name, filename),
                'jar_file_uris':['gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar']
            }
        }
    
        result = dataproc_cluster_client.submit_job(
            project_id=project, region=region, job=job_details)
        job_id = result.reference.job_id
        print('Submitted job ID {}.'.format(job_id))
        return job_id
    

    URI 需要作为数组而不是字符串传递。这解决了问题。

    【讨论】:

      猜你喜欢
      • 2019-12-31
      • 1970-01-01
      • 1970-01-01
      • 2022-08-19
      • 2020-02-10
      • 2020-09-16
      • 1970-01-01
      • 1970-01-01
      • 2020-08-06
      相关资源
      最近更新 更多