【发布时间】:2020-02-04 21:40:39
【问题描述】:
我正在尝试通过 jupyter 运行 PySpark 作业,我需要创建一个函数来运行该作业。我需要传递一个 jar 文件,我正试图弄清楚如何做到这一点。 我确实找到了一些文档:https://cloud.google.com/dataproc/docs/reference/rpc/google.cloud.dataproc.v1#google.cloud.dataproc.v1.SubmitJobRequest
https://cloud.google.com/dataproc/docs/reference/rest/v1beta2/HadoopJob
但我无法确切地弄清楚如何将 URI 添加到函数中。我的函数目前看起来像这样:
from google.cloud import dataproc_v1
def submit_pyspark_job(dataproc_cluster_client, project, region, cluster_name, bucket_name,
filename):
"""Submit the Pyspark job to the cluster (assumes `filename` was uploaded
to `bucket_name."""
job_details = {
'placement': {
'cluster_name': cluster_name
},
'pyspark_job': {
'jar_file_uris':'gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar', #PROBLEM HERE!
'main_python_file_uri': 'gs://{}/{}'.format(bucket_name, filename)
}
}
result = dataproc_cluster_client.submit_job(
project_id=project, region=region, job=job_details)
job_id = result.reference.job_id
print('Submitted job ID {}.'.format(job_id))
return job_id
问题在于作业详细信息参数的 jar_file_uris 部分。目前,我遇到了一个错误。
【问题讨论】:
标签: google-cloud-platform pyspark google-cloud-dataproc jupyterhub