【问题标题】:org.apache.spark.SparkException: Job aborted due to stage failure: Task from applicationorg.apache.spark.SparkException:作业因阶段失败而中止:来自应用程序的任务
【发布时间】:2015-01-09 14:59:52
【问题描述】:

我在独立集群上运行 spark 应用程序时遇到问题。 (我使用 spark 1.1.0 版本)。 我通过命令成功运行主服务器:

bash start-master.sh 

然后我通过命令运行一名工人:

bash spark-class org.apache.spark.deploy.worker.Worker spark://fujitsu11:7077

在主人的网页界面:

http://localhost:8080  

我看到了,master 和 worker 正在运行。

然后我从 Eclipse Luna 运行我的应用程序。我通过命令成功连接到集群

JavaSparkContext sc = new JavaSparkContext("spark://fujitsu11:7077", "myapplication");

并且在该应用程序工作之后,但是当程序实现以下代码时:

 JavaRDD<Document> collectionRdd = sc.parallelize(list);

它正在崩溃并显示以下错误消息:

 org.apache.spark.SparkException: Job aborted due to stage failure: Task 7 in stage 0.0 failed 4 times, most recent failure: Lost task 7.3 in stage 0.0 (TID 11, fujitsu11.inevm.ru):java.lang.ClassNotFoundException: maven.maven1.Document
 java.net.URLClassLoader$1.run(URLClassLoader.java:366)
 java.net.URLClassLoader$1.run(URLClassLoader.java:355)
 java.security.AccessController.doPrivileged(Native Method)
 java.net.URLClassLoader.findClass(URLClassLoader.java:354)
  java.lang.ClassLoader.loadClass(ClassLoader.java:425)
    java.lang.ClassLoader.loadClass(ClassLoader.java:358)
    java.lang.Class.forName0(Native Method)
    java.lang.Class.forName(Class.java:270)
    org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:59)
    java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
    java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
    java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
    java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    java.io.ObjectInputStream.readArray(ObjectInputStream.java:1706)
    java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1344)
    java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
    java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
    java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
    java.io.ObjectInputStream.defaultReadObject(ObjectInputStream.java:500)
    org.apache.spark.rdd.ParallelCollectionPartition.readObject(ParallelCollectionRDD.scala:74)
    sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    java.lang.reflect.Method.invoke(Method.java:606)
    java.io.ObjectStreamClass.invokeReadObject(ObjectStreamClass.java:1017)
    java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1893)
    java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
    java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
    java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
    java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
    java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
    org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:62)
    org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:87)
    org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:159)
    java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
    java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
    java.lang.Thread.run(Thread.java:744)
 Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
at scala.Option.foreach(Option.scala:236)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1391)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

在 shell 中我找到了:

14/11/12 18:46:06 INFO ExecutorRunner: Launch command: "C:\PROGRA~1\Java\jdk1.7.0_51/bin/java"  "-cp" ";;D:\spark\bin\..\conf;D:\spark\bin\..\lib\spark-assembly-
1.1.0-hadoop1.0.4.jar;;D:\spark\bin\..\lib\datanucleus-api-jdo-3.2.1.jar;D:\spar
k\bin\..\lib\datanucleus-core-3.2.2.jar;D:\spark\bin\..\lib\datanucleus-rdbms-3.
2.1.jar" "-XX:MaxPermSize=128m" "-Dspark.driver.port=50913" "-Xms512M" "-Xmx512M
" "org.apache.spark.executor.CoarseGrainedExecutorBackend" "akka.tcp://sparkDriv
er@fujitsu11.inevm.ru:50913/user/CoarseGrainedScheduler" "0" "fujitsu11.inevm.ru
" "8" "akka.tcp://sparkWorker@fujitsu11.inevm.ru:50892/user/Worker" "app-2014111
2184605-0000"
14/11/12 18:46:40 INFO Worker: Asked to kill executor app-20141112184605-0000/0
14/11/12 18:46:40 INFO ExecutorRunner: Runner thread for executor app-2014111218
4605-0000/0 interrupted
14/11/12 18:46:40 INFO ExecutorRunner: Killing process!
14/11/12 18:46:40 INFO Worker: Executor app-20141112184605-0000/0 finished with
state KILLED exitStatus 1
14/11/12 18:46:40 INFO LocalActorRef: Message [akka.remote.transport.ActorTransp
ortAdapter$DisassociateUnderlying] from Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtoco
l-tcp%3A%2F%2FsparkWorker%40192.168.3.5%3A50955-2#1066511138] was not delivered.
[1] dead letters encountered. This logging can be turned off or adjusted with c
onfiguration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-
shutdown'.
14/11/12 18:46:40 INFO LocalActorRef: Message [akka.remote.transport.Association
Handle$Disassociated] from Actor[akka://sparkWorker/deadLetters] to Actor[akka:/
/sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2
FsparkWorker%40192.168.3.5%3A50955-2#1066511138] was not delivered. [2] dead let
ters encountered. This logging can be turned off or adjusted with configuration
settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'.
14/11/12 18:46:41 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker
@fujitsu11.inevm.ru:50892] -> [akka.tcp://sparkExecutor@fujitsu11.inevm.ru:50954
]: Error [Association failed with [akka.tcp://sparkExecutor@fujitsu11.inevm.ru:5
0954]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sp
arkExecutor@fujitsu11.inevm.ru:50954]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon
$2: Connection refused: no further information: fujitsu11.inevm.ru/192.168.3.5:5
0954
]
14/11/12 18:46:42 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker
@fujitsu11.inevm.ru:50892] -> [akka.tcp://sparkExecutor@fujitsu11.inevm.ru:50954
]: Error [Association failed with [akka.tcp://sparkExecutor@fujitsu11.inevm.ru:5
0954]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sp
arkExecutor@fujitsu11.inevm.ru:50954]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon
$2: Connection refused: no further information: fujitsu11.inevm.ru/192.168.3.5:5
0954
]
14/11/12 18:46:43 ERROR EndpointWriter: AssociationError [akka.tcp://sparkWorker
@fujitsu11.inevm.ru:50892] -> [akka.tcp://sparkExecutor@fujitsu11.inevm.ru:50954
]: Error [Association failed with [akka.tcp://sparkExecutor@fujitsu11.inevm.ru:5
0954]] [
akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sp
arkExecutor@fujitsu11.inevm.ru:50954]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon
$2: Connection refused: no further information: fujitsu11.inevm.ru/192.168.3.5:5
0954
]

在日志中:

14/11/12 18:46:41 ERROR EndpointWriter: AssociationError    [akka.tcp://sparkMaster@fujitsu11:7077]     -> [akka.tcp://sparkDriver@fujitsu11.inevm.ru:50913]:   Error [Association failed with [akka.tcp://sparkDriver@fujitsu11.inevm.ru:50913]] [
akka.remote.EndpointAssociationException: Association failed with   [akka.tcp://sparkDriver@fujitsu11.inevm.ru:50913]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection  refused: no further information: fujitsu11.inevm.ru/192.168.3.5:50913
]
14/11/12 18:46:42 INFO Master: akka.tcp://sparkDriver@fujitsu11.inevm.ru:50913 got disassociated,   removing it.
14/11/12 18:46:42 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@fujitsu11:7077] -> [akka.tcp://sparkDriver@fujitsu11.inevm.ru:50913]: Error [Association failed with   [akka.tcp://sparkDriver@fujitsu11.inevm.ru:50913]] [
akka.remote.EndpointAssociationException: Association failed with   [akka.tcp://sparkDriver@fujitsu11.inevm.ru:50913]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection  refused: no further information: fujitsu11.inevm.ru/192.168.3.5:50913
]
14/11/12 18:46:43 ERROR EndpointWriter: AssociationError [akka.tcp://sparkMaster@fujitsu11:7077] -> [akka.tcp://sparkDriver@fujitsu11.inevm.ru:50913]: Error [Association failed with   [akka.tcp://sparkDriver@fujitsu11.inevm.ru:50913]] [
akka.remote.EndpointAssociationException: Association failed with   [akka.tcp://sparkDriver@fujitsu11.inevm.ru:50913]
Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection  refused: no further information: fujitsu11.inevm.ru/192.168.3.5:50913
]

我用谷歌搜索了很多,但我不知道出了什么问题...... 我在这里发现了一些类似的讨论:

https://github.com/datastax/spark-cassandra-connector/issues/187

但这并不能解决我的问题...

有人知道怎么了?

谢谢。

【问题讨论】:

  • 今天遇到了同样的错误。当我从 spark-shell 尝试相同的方法时,它可以工作。不给出任何错误。通过 maven 运行会引发 ClassNotFound 异常。你试过这个:stackoverflow.com/questions/24855368/…(使用 spark-submit 运行)?
  • 感谢您的快速回复!不,我会试试你的方法,谢谢你的建议!但是很遗憾,从 IDE 运行会抛出异常...
  • 知道了。问题出在类路径上。使用第一个答案:stackoverflow.com/questions/574594/… 创建一个胖罐。然后使用 spark-submit,运行应用程序。对我来说是这样的: ./spark-submit --class "sandbox.Mllib.MllibTest" --master "spark://JPawade.local:7077" /Users/aditya.pawade/Projects/IntelliJ/Sandbox/ target/sandbox-1.0-SNAPSHOT-jar-with-dependencies.jar 然后它应该运行。也许有不同的解决方案。但这确实有效。
  • 谢谢你,你太棒了!我会尝试报告结果!

标签: apache-spark


【解决方案1】:

为了其他遇到此问题的人的利益:

由于使用的 spark 连接器和 spark 版本不匹配,我遇到了同样的问题。 Spark 是 1.3.1,连接器是 1.3.0,出现相同的错误消息:

org.apache.spark.SparkException: Job aborted due to stage failure:
  Task 2 in stage 0.0 failed 4 times, most recent failure: Lost 
  task 2.3 in stage 0.0

更新 SBT 中的依赖关系解决了这个问题。

【讨论】:

  • 如何更新 SBT 中的依赖关系以及 SBT 是什么?
  • @l--''''''------'''''''''''' SBT 是Scala Build Tool
【解决方案2】:

找到了一种使用 IDE / Maven 运行它的方法

  1. 创建一个 Fat Jar(一个包含所有依赖项的)。为此使用 Shade 插件。示例 pom:
<plugin>
    <groupId>org.apache.maven.plugins</groupId>
    <artifactId>maven-shade-plugin</artifactId>
    <version>2.2</version>
    <configuration>
        <filters>
            <filter>
                <artifact>*:*</artifact>
                <excludes>
                    <exclude>META-INF/*.SF</exclude>
                    <exclude>META-INF/*.DSA</exclude>
                    <exclude>META-INF/*.RSA</exclude>
                </excludes>
            </filter>
        </filters>
    </configuration>
    <executions>
        <execution>
            <id>job-driver-jar</id>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <shadedArtifactAttached>true</shadedArtifactAttached>
                <shadedClassifierName>driver</shadedClassifierName>
                <transformers>
                    <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                    <!--
                    Some care is required:
                    http://doc.akka.io/docs/akka/snapshot/general/configuration.html
                    -->
                    <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                        <resource>reference.conf</resource>
                    </transformer>
                    <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                        <mainClass>mainClass</mainClass>
                    </transformer>
                </transformers>
            </configuration>
        </execution>
        <execution>
            <id>worker-library-jar</id>
            <phase>package</phase>
            <goals>
                <goal>shade</goal>
            </goals>
            <configuration>
                <shadedArtifactAttached>true</shadedArtifactAttached>
                <shadedClassifierName>worker</shadedClassifierName>
                <transformers>
                    <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                </transformers>
            </configuration>
        </execution>
    </executions>
</plugin>
  1. 现在我们必须将编译好的 jar 文件发送到集群。为此,请在 spark 配置中指定 jar 文件,如下所示:

SparkConf 配置 = 新 SparkConf().setAppName("appName").setMaster("spark://machineName:7077").setJars(new String[] {"target/appName-1.0-SNAPSHOT-driver.jar"});

  1. 运行 mvn clean package 以创建 Jar 文件。它将在您的目标文件夹中创建。

  2. 使用您的 IDE 或使用 maven 命令运行:

mvn exec:java -Dexec.mainClass="className"

这不需要 spark-submit。运行前记得打包文件

如果你不想硬编码 jar 路径,你可以这样做:

  1. 在配置中,写:

SparkConf 配置 = 新 SparkConf() .setAppName("appName") .setMaster("spark://machineName:7077") .setJars(JavaSparkContext.jarOfClass(this.getClass()));

  1. 创建fat jar(如上)并在运行包命令后使用maven运行:

java -jar 目标/application-1.0-SNAPSHOT-driver.jar

这将从加载类的 jar 中获取 jar。

【讨论】:

  • 男人!在本地机器上工作就像一个魅力! (我使用了你的第二个建议——创建了 fat jar,更新了 SparkConf,并且 worker 已经启动。)但是当我尝试使用远程机器连接到 master 的机器时——我对类路径有同样的错误......怎么办你想——我应该在远程机器上做点什么,还是需要一些其他的技巧?谢谢!
  • 我不知道它到底是如何工作的,但从我读过的内容来看,我认为驱动程序创建了一个服务器,所有其他集群工作节点都连接到该服务器并获取应用程序 jar。所以你有2个选择。 1. 你的依赖的 jar 文件应该在工作节点的类路径中可用(在 SCALA_CLASSPATH 中配置),或者它应该在驱动程序中可用,工作人员与驱动节点连接。你能详细解释一下你的架构吗?我无法理解远程机器(工作者/应用程序)和一般网络是什么
  • 我的主机 - 是一台机器,我在其中运行主服务器,并在其中启动我的应用程序。远程机器 - 是我只运行bash spark-class org.apache.spark.deploy.worker.Worker spark://mastermachineIP:7077 的机器。两台机器都在一个本地网络中,远程机器成功连接到主服务器。 (我在主人的 localhost:8080 看到它)。也许我应该在 SparkConf 中设置一些设置?当我创建胖罐子时,只有JavaSparkContext sc = new JavaSparkContext("spark://mastermachineIP:7077", "myapplication");
  • 你是如何运行应用程序的?从主服务器?使用罐子?此外,fat jar 必须包含依赖项的 jar 文件。创建胖罐子时,必须指定罐子,然后打包。
  • 但是您最终打算如何部署它?最后,您必须编译它,将其打包成一个胖 jar,然后使用 java -jar 运行它,对吗?你可以这样试试吗?如果使用IDE运行,需要先编译打包应用,然后在SparkConf中指定jar的路径,然后运行。我能够在我的系统中以这种方式运行它
【解决方案3】:

我遇到了同样的错误消息,在我的情况下,我的 rdd 是空的,并且尝试针对它执行聚合任务。

在此处列出此案例是为了让遇到此错误消息的其他人受益: Job aborted due to stage failure: Task 9 in stage 24.0 failed 4 times

下面提供的链接中的建议很有帮助。 “.. rdd 变空了。空指针异常表示尝试针对空值执行聚合任务。检查您的数据是否为空,其中不应该存在空值,尤其是在那些作为聚合主题的列上” https://community.cloudera.com/t5/Support-Questions/PySpark-failuer-spark-SparkException-Job-aborted-due-to/td-p/171147

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 2019-08-29
    • 2020-11-07
    • 2019-12-25
    • 2018-03-18
    • 2022-08-03
    • 2023-03-20
    • 2020-10-04
    • 2019-09-24
    相关资源
    最近更新 更多