【问题标题】:Spark sql query execution fails with org.apache.parquet.io.ParquetDecodingExceptionSpark sql 查询执行失败并出现 org.apache.parquet.io.ParquetDecodingException
【发布时间】:2020-08-03 21:22:43
【问题描述】:

我正在使用 spark-submit(集群模式)在 spark sql 中执行一个简单的创建表查询。接收 org.apache.parquet.io.ParquetDecodingException。我可以通过互联网获得有关此问题的一些详细信息,其中一个建议是添加配置 spark.sql.parquet.writeLegacyFormat=true。添加此设置后问题仍然存在。

下面是查询:

spark.sql("""
CREATE TABLE TestTable
 STORED AS PARQUET 
    AS 
SELECT Col1, 
       Col2, 
       Col3 
FROM Stable""")

错误描述:

Caused by: org.apache.parquet.io.ParquetDecodingException: Can not read value at 1 in block 0 in file maprfs:///path/disputer/1545555-r-00000.snappy.parquet
at org.apache.parquet.hadoop.InternalParquetRecordReader.nextKeyValue(InternalParquetRecordReader.java:461)
at org.apache.parquet.hadoop.ParquetRecordReader.nextKeyValue(ParquetRecordReader.java:219)
at org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:109)
at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:186)
... 13 more
Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt

Spark 配置文件:

spark.driver.memory=10G
spark.executor.memory=23G 
spark.executor.cores=3
spark.executor.instances=100  
spark.dynamicAllocation.enabled=false      
spark.yarn.preserve.staging.files=false  
spark.yarn.executor.extraJavaOptions=-XX:MaxDirectMemorySize=6144m    
spark.sql.shuffle.partitions=1000
spark.shuffle.service=true  
spark.yarn.maxAppAttempts=1  
spark.broadcastTimeout=36000  
spark.debug.maxToStringFields=100  
spark.hadoop.mapreduce.fileoutputcommitter.algorithm.version=2  
spark.network.timeout=600s  
spark.sql.parquet.enableVectorizedReader=false
spark.scheduler.listenerbus.eventqueue.capacity=200000  
spark.driver.memoryOverhead=1024  
spark.yarn.executor.memoryOverhead=5120  
spark.executor.extraJavaOptions=-XX:+UseG1GC  
spark.driver.extraJavaOptions=-XX:+UseG1GC

【问题讨论】:

    标签: scala apache-spark apache-spark-sql


    【解决方案1】:

    此问题是由于禁用 spark.sql.parquet.enableVectorizedReader 所致。 spark.sql.parquet.enableVectorizedReader=true 解决了这个问题。

    更多详情,请访问https://jaceklaskowski.gitbooks.io/mastering-spark-sql/spark-sql-vectorized-parquet-reader.html

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2022-07-18
      • 2018-11-20
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多