由于 java.io.NotSerializableException: org.apache.spark.SparkContext 导致 Spark 作业失败答案

【问题标题】：Spark job is failed due to java.io.NotSerializableException: org.apache.spark.SparkContext由于 java.io.NotSerializableException: org.apache.spark.SparkContext 导致 Spark 作业失败
【发布时间】：2014-06-29 14:37:52
【问题描述】：

当我尝试在RDD[(Int,ArrayBuffer[(Int,Double)])] 输入上应用方法（ComputeDwt）时，我遇到了上述异常。我什至使用extends Serialization 选项来序列化 spark 中的对象。这里是代码 sn-p。

input:series:RDD[(Int,ArrayBuffer[(Int,Double)])] 
DWTsample extends Serialization is a class having computeDwt function.
sc: sparkContext

val  kk:RDD[(Int,List[Double])]=series.map(t=>(t._1,new DWTsample().computeDwt(sc,t._2)))

Error:
org.apache.spark.SparkException: Job failed: java.io.NotSerializableException: org.apache.spark.SparkContext
org.apache.spark.SparkException: Job failed: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:760)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:758)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:758)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:556)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:503)
at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:361)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$run(DAGScheduler.scala:441)
at org.apache.spark.scheduler.DAGScheduler$$anon$1.run(DAGScheduler.scala:149)

谁能建议我可能是什么问题以及应该做些什么来克服这个问题？

【问题讨论】：

可能是骗子：stackoverflow.com/questions/21071152/…

标签： java scala hadoop apache-spark

【解决方案1】：

线

series.map(t=>(t._1,new DWTsample().computeDwt(sc,t._2)))

引用 SparkContext (sc) 但 SparkContext 不可序列化。 SparkContext 旨在公开在驱动程序上运行的操作；它不能被在 worker 上运行的代码引用/使用。

您必须重新构建代码结构，以便在 map 函数闭包中不引用 sc。

【讨论】：