【发布时间】:2015-06-24 23:56:01
【问题描述】:
以下简单的 json DataFrame 测试在本地模式下运行 Spark 时可以正常工作。这是 Scala sn-p,但我也成功地在 Java 和 Python 中实现了同样的功能:
sparkContext.addFile(jsonPath)
val sqlContext = new org.apache.spark.sql.SQLContext(sparkContext)
val dataFrame = sqlContext.jsonFile(jsonPath)
dataFrame.show()
我确保 jsonPath 在驱动端和工作端都有效。而我正在调用 addFile... json 文件非常琐碎:
[{"age":21,"name":"abc"},{"age":30,"name":"def"},{"age":45,"name":"ghi"}]
当我切换出本地模式并使用具有单个主/从器的单独 Spark 服务器时,完全相同的代码会失败。我在 Scala、Java 和 Python 中尝试过同样的测试,试图找到一些有效的组合。他们都得到基本相同的错误。以下错误来自 Scala 驱动程序,但 Java/Python 错误消息几乎相同:
15/04/17 18:05:26 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.0.2.15): java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)
at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
这很令人沮丧。我基本上是在尝试从官方文档中获取代码 sn-ps 来工作。
更新:感谢 Paul 的深入回应。我在执行相同的步骤时遇到错误。仅供参考,之前我使用的是驱动程序,因此命名为 sparkContext 而不是 shell 默认名称 sc。这是删除了多余日志的缩写 sn-p:
➜ spark-1.3.0 ./bin/spark-shell --master spark://172.28.128.3:7077
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/___/ .__/\_,_/_/ /_/\_\ version 1.3.0
/_/
Using Scala version 2.11.2 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_40)
Type in expressions to have them evaluated.
Type :help for more information.
Spark context available as sc.
SQL context available as sqlContext.
scala> val dataFrame = sqlContext.jsonFile("/private/var/userspark/test.json")
15/04/20 18:01:06 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, 10.0.2.15): java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)
at java.io.ObjectInputStream.readFully(ObjectInputStream.java:1033)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:63)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101)
at org.apache.hadoop.io.UTF8.readChars(UTF8.java:216)
at org.apache.hadoop.io.UTF8.readString(UTF8.java:208)
at org.apache.hadoop.mapred.FileSplit.readFields(FileSplit.java:87)
at org.apache.hadoop.io.ObjectWritable.readObject(ObjectWritable.java:237)
(...)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, 10.0.2.15): java.io.EOFException
at java.io.ObjectInputStream$BlockDataInputStream.readFully(ObjectInputStream.java:2747)
【问题讨论】:
-
我记得阅读它并不能真正解析完整的 JSON,即不需要开始/结束方括号,并且每行需要一个对象,以便它可以拆分文件并并行运行。