【问题标题】:Error ExecutorLostFailure when running a task in Spark在 Spark 中运行任务时出现错误 ExecutorLostFailure
【发布时间】:2015-10-10 08:20:56
【问题描述】:

当我尝试在此文件夹上运行它时,它每次都会抛出 ExecutorLostFailure

您好,我是 Spark 的初学者。我正在尝试在 Spark 1.4.1 上运行一个作业,其中有 8 个从节点,每个 3.2 GB 磁盘具有 11.7 GB 内存。我正在从 从属节点 之一(从 8 个节点)运行 Spark 任务(因此每个节点上只有 0.7 个存储部分,大约 4.8 gb 可用)并使用 Mesos 作为集群管理器。我正在使用这个配置:

spark.master mesos://uc1f-bioinfocloud-vamp-m-1:5050
spark.eventLog.enabled true
spark.driver.memory 6g
spark.storage.memoryFraction 0.7
spark.core.connection.ack.wait.timeout 800
spark.akka.frameSize 50
spark.rdd.compress true

我正在尝试在大约 14 GB 数据的文件夹上运行 Spark MLlib 朴素贝叶斯算法。 (当我在 6 GB 文件夹上运行任务时没有问题)我正在从谷歌存储中读取这个文件夹作为 RDD 并将 32 作为分区参数。(我也尝试过增加分区)。然后使用 TF 创建特征向量并在此基础上进行预测。 但是当我试图在这个文件夹上运行它时,它每次都会抛出我ExecutorLostFailure。我尝试了不同的配置,但没有任何帮助。可能是我错过了一些非常基本但无法弄清楚的东西。任何帮助或建议都将非常有价值。

日志是:

   15/07/21 01:18:20 ERROR TaskSetManager: Task 3 in stage 2.0 failed 4 times; aborting job    
15/07/21 01:18:20 INFO TaskSchedulerImpl: Cancelling stage 2    
15/07/21 01:18:20 INFO TaskSchedulerImpl: Stage 2 was cancelled    
15/07/21 01:18:20 INFO DAGScheduler: ResultStage 2 (collect at /opt/work/V2ProcessRecords.py:213) failed in 28.966 s    
15/07/21 01:18:20 INFO DAGScheduler: Executor lost: 20150526-135628-3255597322-5050-1304-S8 (epoch 3)    
15/07/21 01:18:20 INFO BlockManagerMasterEndpoint: Trying to remove executor 20150526-135628-3255597322-5050-1304-S8 from BlockManagerMaster.    
15/07/21 01:18:20 INFO DAGScheduler: Job 2 failed: collect at /opt/work/V2ProcessRecords.py:213, took 29.013646 s    
Traceback (most recent call last):    
  File "/opt/work/V2ProcessRecords.py", line 213, in <module>
    secondPassRDD = firstPassRDD.map(lambda ( name, title,  idval, pmcId, pubDate, article, tags , author, ifSigmaCust, wclass): ( str(name), title,  idval, pmcId, pubDate, article, tags , author, ifSigmaCust , "Yes" if ("PMC" + pmcId) in rddNIHGrant else ("No") , wclass)).collect()    
  File "/usr/local/spark/python/lib/pyspark.zip/pyspark/rdd.py", line 745, in collect    
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__    
  File "/usr/local/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 3 in stage 2.0 failed 4 times, most recent failure: Lost task 3.3 in stage 2.0 (TID 12, vamp-m-2.c.quantum-854.internal): ExecutorLostFailure (executor 20150526-135628-3255597322-5050-1304-S8 lost)    
Driver stacktrace:    
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
        at       org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)
        at    org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)
        at     scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
        at     org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)
        at    org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)
        at scala.Option.foreach(Option.scala:236)
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

15/07/21 01:18:20 INFO BlockManagerMaster: Removed 20150526-135628-3255597322-5050-1304-S8 successfully in removeExecutor
15/07/21 01:18:20 INFO DAGScheduler: Host added was in lost list earlier:vamp-m-2.c.quantum-854.internal
Jul 21, 2015 1:01:15 AM INFO: parquet.hadoop.ParquetFileReader: Initiating action with parallelism: 5
15/07/21 01:18:20 INFO SparkContext: Invoking stop() from shutdown hook



{"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":11,"Index":6,"Attempt":2,"Launch Time":1437616381852,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}}

{"Event":"SparkListenerExecutorRemoved","Timestamp":1437616389696,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"丢失执行器"} {"Event":"SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526-135628-3255597322-5050-1304-S8"},"任务信息":{"任务 ID":11,"索引":6,"尝试":2,"启动时间":1437616381852, "执行者 ID":"20150526-135628-3255597322-5050-1304-S8","主机":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":" PROCESS_LOCAL","推测性":false,"获取结果时间":0,"完成时间":1437616389697,"失败":true,"Accumulables":[]}} {"Event":"SparkListenerExecutorAdded","Timestamp":1437616389707,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Executor Info":{"Host":"uc1f-bioinfocloud-vamp -m-2.c.quantum-device-854.internal","Total Cores":1,"Log Urls":{}}} {"Event":"SparkListenerTaskStart","Stage ID":2,"Stage Attempt ID":0,"Task Info":{"Task ID":12,"Index":6,"Attempt":3,"启动时间":1437616389702,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Host":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal" "Locality":"PROCESS_LOCAL","Speculative":false,"Getting Result Time":0,"Finish Time":0,"Failed":false,"Accumulables":[]}} {"Event":"SparkListenerExecutorRemoved","Timestamp":1437616397743,"Executor ID":"20150526-135628-3255597322-5050-1304-S8","Removed Reason":"丢失执行器"} {"Event":"SparkListenerTaskEnd","Stage ID":2,"Stage Attempt ID":0,"Task Type":"ResultTask","Task End Reason":{"Reason":"ExecutorLostFailure","Executor ID":"20150526-135628-3255597322-5050-1304-S8"},"任务信息":{"任务 ID":12,"索引":6,"尝试":3,"启动时间":1437616389702, "执行者 ID":"20150526-135628-3255597322-5050-1304-S8","主机":"uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal","Locality":" PROCESS_LOCAL","Speculative":false,"获取结果时间":0,"完成时间":1437616397743,"Failed":true,"Accumulables":[]}} {"Event":"SparkListenerStageCompleted","Stage Info":{"Stage ID":2,"Stage Attempt ID":0,"Stage Name":"collect at /opt/work/V2ProcessRecords.py:215", "任务数":72,"RDD Info":[{"RDD ID":6,"Name":"PythonRDD","Parent IDs":[0],"存储级别":{"使用磁盘": false,"Use Memory":false,"Use ExternalBlockStore":false,"Deserialized":false,"Replication":1},"Number of Partitions":72,"Number of Cached Partitions":0,"Memory Size" :0,"ExternalBlockStore Size":0,"Disk Size":0},{"RDD ID":0,"Name":"gs://uc1f-bioinfocloud-vamp-m/literature/xml/P*/ *.nxml","Scope":"{\"id\":\"0\",\"name\":\"wholeTextFiles\"}","Parent IDs":[],"Storage Level": {“使用磁盘”:false,“使用内存”:false,“使用ExternalBlockStore”:false,“反序列化”:false,“复制”:1},“分区数”:72,“缓存分区数”: 0,"Memory Size":0,"ExternalBlockStore Size":0,"Disk Size":0}],"Parent IDs":[],"Details":"","Submission Time":1437616365566,"Completion Time ":1437616397753,"失败原因":"作业因阶段失败而中止:Tas 2.0 阶段的 k 6 失败 4 次,最近一次失败:2.0 阶段丢失任务 6.3(TID 12,uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal):ExecutorLostFailure(执行程序 20150526-135628 -3255597322-5050-1304-S8 丢失)\n驱动程序堆栈跟踪:","Accumulables":[]}} {“事件”:“SparkListenerJobEnd”,“作业 ID”:2,“完成时间”:1437616397755,“作业结果”:{“结果”:“作业失败”,“异常”:{“消息”:“作业中止到期到阶段失败:阶段 2.0 中的任务 6 失败 4 次,最近一次失败:在阶段 2.0 中丢失任务 6.3(TID 12,uc1f-bioinfocloud-vamp-m-2.c.quantum-device-854.internal):ExecutorLostFailure(执行程序 20150526-135628-3255597322-5050-1304-S8 丢失)\n驱动程序堆栈跟踪:“,堆栈跟踪”:[{“声明类”:“org.apache.spark.scheduler.DAGScheduler”,“方法名称”:“ org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages","文件名":"DAGScheduler.scala","行号":1266},{"声明类":"org.apache.spark.scheduler.DAGScheduler$ $anonfun$abortStage$1","方法名":"apply","文件名":"DAGScheduler.scala","行号":1257},{"声明类":"org.apache.spark.scheduler. DAGScheduler$$anonfun$abortStage$1","方法名":"apply","文件名":"DAGScheduler.scala","行号":1256},{"声明类":"scala.collection.mutable.可调整大小Array$class","方法名":"foreach","文件名":"ResizableArray.scala","行号":59},{"声明类":"scala.collection.mutable.ArrayBuffer","方法名":"foreach","文件名":"ArrayBuffer.scala","行号":47},{"声明类":"org.apache.spark.scheduler.DAGScheduler","方法名": "abortStage","文件名":"DAGScheduler.scala","行号":1256},{"声明类":"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1","方法名":"apply","文件名":"DAGScheduler.scala","行号":730},{"声明类":"org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1","方法名":"apply","文件名":"DAGScheduler.scala","行号":730},{"声明类":"scala.Option","方法名":"foreach","文件名称":"Option.scala","行号":236},{"声明类":"org.apache.spark.scheduler.DAGScheduler","方法名":"handleTaskSetFailed","文件名":" DAGScheduler.scala","行号":730},{"声明类":"org.apache.spark.scheduler.DAGSc hedulerEventProcessLoop","方法名":"onReceive","文件名":"DAGScheduler.scala","行号":1450},{"声明类":"org.apache.spark.scheduler.DAGSchedulerEventProcessLoop","方法名":"onReceive","文件名":"DAGScheduler.scala","行号":1411},{"声明类":"org.apache.spark.util.EventLoop$$anon$1","方法名":"run","文件名":"EventLoop.scala","行号":48}]}}}

【问题讨论】:

  • 您找到解决方案了吗?

标签: apache-spark pyspark apache-spark-mllib collect


【解决方案1】:

如果没有失败执行程序的日志而不是驱动程序的日志,很难说问题出在哪里,但很可能是内存问题。尝试显着增加分区数(如果您当前是 32 尝试 200)

【讨论】:

  • 我尝试了 200 个分区,但即使那样它也失败了。即使有 800 个分区以及其他一些配置设置。但是
  • 但是我遇到同样的问题 Task Lost 比 ExecutorLostFailure 少 4 次。有时我得到连接超时。此外,由于我在 Google Cloud Mesos 集群上,我尝试按照您的建议查找日志并查看了 var/log/mesos(默认情况下,主日志和从属日志都在 /var/log/mesos 中,如 spark mesos 文档中的建议)但是我没有找到任何好的信息。我可以在这里查看或发布任何其他日志吗?对于执行程序日志,您的意思是一样的吗?
【解决方案2】:

出现此错误是因为某项任务失败了四次以上。 尝试使用以下参数增加集群中的并行度。

--conf "spark.default.parallelism=100" 

将并行度值设置为集群上可用内核数的 2 到 3 倍。如果这不起作用。尝试以指数方式增加并行度。即,如果您当前的并行性不起作用,则将其乘以 2,依此类推。我还观察到,如果您的并行度是质数,尤其是在您使用 groupByKkey 时,它会有所帮助。

【讨论】:

  • 增加分区号对我的情况没有帮助。但设置并行数会有所帮助。
【解决方案3】:

我遇到了这个问题,对我来说问题是reduceByKey 任务中一个键的发生率非常高。这(我认为)导致在其中一个执行程序上收集大量列表,然后会引发 OOM 错误。

对我来说,解决方案是在执行 reduceByKey 之前过滤掉人口众多的密钥,但我很欣赏这可能会也可能不会,具体取决于您的应用程序。反正我不需要我的所有数据。

【讨论】:

    【解决方案4】:

    据我了解,ExecutorLostFailure 最常见的原因是执行器中的OOM。

    为了解决 OOM 问题,需要弄清楚到底是什么原因造成的。简单地增加默认并行度或增加执行器内存不是战略解决方案。

    如果您看看增加并行性的作用是尝试创建更多的执行器,以便每个执行器可以处理越来越少的数据。但是,如果您的数据是倾斜的,以至于发生数据分区的键(对于并行性)具有更多数据,那么简单地增加并行性将没有效果。

    同样地,仅仅通过增加 Executor 内存将是处理这种情况的一种非常低效的方法,例如只有一个 executor 因 ExecutorLostFailure 而失败,为所有 executor 请求增加内存将使您的应用程序需要比实际预期更多的内存。

    【讨论】:

    • 但是如何找出究竟是什么原因造成的?解决方案在哪里?
    • 我也有同样的问题。我什至尝试重新分区我的数据集以解决偏斜问题,但它仍然有效。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2018-05-30
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多