【发布时间】:2019-06-19 18:31:00
【问题描述】:
我有 HDFS 目录,其中有 13.2 GB 和 4 个文件。我正在尝试使用 spark 中的wholeTextFile 方法读取所有文件,但我有一些问题
这是我的代码。
val path = "/tmp/cnt/warehouse/"
val whole = sc.wholeTextFiles("path",32)
val data = whole.map(r => (r._1,r._2.split("\r\n")))
val x = file.flatMap(r => r._1)
x.take(1000).foreach(println)
下面是火花提交。
spark2-submit \
--class SparkTest \
--master yarn \
--deploy-mode cluster \
--num-executors 32 \
--executor-memory 15G \
--driver-memory 25G \
--conf spark.yarn.maxAppAttempts=1 \
--conf spark.port.maxRetries=100 \
--conf spark.kryoserializer.buffer.max=1g \
--conf spark.yarn.queue=xyz \
SparkTest-1.0-SNAPSHOT.jar
- 即使我给最小分区 32,它只存储在 4 个分区中。
- 我的 spark 提交是否正确?
下面的错误
Job aborted due to stage failure: Task 0 in stage 32.0 failed 4 times, most recent failure: Lost task 0.3 in stage 32.0 (TID 113, , executor 37): ExecutorLostFailure (executor 37 exited caused by one of the running tasks) Reason: Container from a bad node: container_e599_1560551438641_35180_01_000057 on host: . Exit status: 52. Diagnostics: Exception from container-launch.
Container id: container_e599_1560551438641_35180_01_000057
Exit code: 52
Stack trace: ExitCodeException exitCode=52:
at org.apache.hadoop.util.Shell.runCommand(Shell.java:604)
at org.apache.hadoop.util.Shell.run(Shell.java:507)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:789)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.__launchContainer__(LinuxContainerExecutor.java:399)
at org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor.launchContainer(LinuxContainerExecutor.java)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Container exited with a non-zero exit code 52
.
Driver stacktrace:
【问题讨论】:
-
Spark textFile vs wholeTextFiles 的可能重复项。这应该可以解释为什么
wholeTextFiles在文件很大时会失败,以及为什么你只能得到 4 个分区。如果可能,请尝试使用textFile。 -
您能否向我们提供您在从 HDFS 读取文件时遇到的错误
-
错误信息被添加到问题@Nikk
标签: scala apache-spark