【发布时间】:2019-01-02 15:52:30
【问题描述】:
当尝试以编程方式(从使用 dataproc 库的 Java 应用程序)提交 Hadoop MapReduce 作业时,该作业会立即失败。通过 UI 提交完全相同的作业时,它工作正常。
我已尝试通过 SSH 连接到 Dataproc 集群以确认文件存在、检查权限并更改了 jar 引用。还没有任何效果。
我得到的错误:
Exception in thread "main" java.lang.ClassNotFoundException: file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:264)
at com.google.cloud.hadoop.services.agent.job.shim.HadoopRunClassShim.main(HadoopRunClassShim.java:18)
Job output is complete
当我在控制台中克隆失败的作业并查看 REST 等效项时,我看到的是:
POST /v1/projects/project-id/regions/us-east1/jobs:submit/
{
"projectId": "project-id",
"job": {
"reference": {
"projectId": "project-id",
"jobId": "jobDoesNotWork"
},
"placement": {
"clusterName": "cluster-name",
"clusterUuid": "uuid"
},
"submittedBy": "service-account@project.iam.gserviceaccount.com",
"jobUuid": "uuid",
"hadoopJob": {
"args": [
"-Dmapred.reduce.tasks=20",
"-Dmapred.output.compress=true",
"-Dmapred.compress.map.output=true",
"-Dstream.map.output.field.separator=,",
"-Dmapred.textoutputformat.separator=,",
"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
"-Dmapreduce.input.fileinputformat.split.minsize=268435456",
"-Dmapreduce.input.fileinputformat.split.maxsize=268435456",
"-mapper",
"/bin/cat",
"-reducer",
"/bin/cat",
"-inputformat",
"org.apache.hadoop.mapred.lib.CombineTextInputFormat",
"-outputformat",
"org.apache.hadoop.mapred.TextOutputFormat",
"-input",
"gs://input/path/",
"-output",
"gs://output/path/"
],
"mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar"
}
}
}
当我通过控制台提交作业时,它可以工作。该作业的 REST 等效项:
POST /v1/projects/project-id/regions/us-east1/jobs:submit/
{
"projectId": "project-id",
"job": {
"reference": {
"projectId": "project-id,
"jobId": "jobDoesWork"
},
"placement": {
"clusterName": "cluster-name,
"clusterUuid": ""
},
"submittedBy": "user_email_account@email.com",
"jobUuid": "uuid",
"hadoopJob": {
"args": [
"-Dmapred.reduce.tasks=20",
"-Dmapred.output.compress=true",
"-Dmapred.compress.map.output=true",
"-Dstream.map.output.field.separator=,",
"-Dmapred.textoutputformat.separator=,",
"-Dmapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec",
"-Dmapreduce.input.fileinputformat.split.minsize=268435456",
"-Dmapreduce.input.fileinputformat.split.maxsize=268435456",
"-mapper",
"/bin/cat",
"-reducer",
"/bin/cat",
"-inputformat",
"org.apache.hadoop.mapred.lib.CombineTextInputFormat",
"-outputformat",
"org.apache.hadoop.mapred.TextOutputFormat",
"-input",
"gs://input/path/",
"-output",
"gs://output/path/"
],
"mainJarFileUri": "file:///usr/lib/hadoop-mapreduce/hadoop-streaming-2.8.4.jar"
}
}
}
我 ssh'ed 进入盒子并确认该文件实际上是存在的。我真正能看到的唯一区别是“submittedBy”。一个有效,一个无效。我猜这是一个权限问题,但我似乎无法说出在每种情况下从哪里获取权限。在这两种情况下,Dataproc 集群都是使用相同的服务帐号创建的。
查看我看到的集群上那个 jar 的权限:
-rw-r--r-- 1 root root 133856 Nov 27 20:17 hadoop-streaming-2.8.4.jar
lrwxrwxrwx 1 root root 26 Nov 27 20:17 hadoop-streaming.jar -> hadoop-streaming-2.8.4.jar
我尝试将 mainJarFileUri 从显式指向版本控制的 jar 更改为链接(因为它具有打开权限),但并没有真正期望它能够工作。但事实并非如此。
有没有更多 Dataproc 经验的人知道这里发生了什么,以及我该如何解决?
【问题讨论】:
-
您能否添加失败作业尝试的
gcloud dataproc jobs describe <jobid>的输出,如果可能,添加您用于以编程方式构建作业设置的代码的 sn-p?
标签: hadoop-streaming google-cloud-dataproc