【发布时间】:2017-02-11 19:33:35
【问题描述】:
我有一个包含多个 Int8 和 String 列的架构,我已将它们写入 Parquet 格式并存储在 S3A 存储桶中以供以后使用。
当我尝试使用 SqlContext.read.option("mergeSchema","false").parquet("s3a://....") 读取此 parquet 文件时,出现以下异常。
我曾尝试使用 parquet-tools(带有模式和元选项)来读取 parquet 文件,但出现未知命令错误。
*Exception in thread "main" org.apache.spark.sql.AnalysisException: Duplicate column(s) : "Int8", "String" found, cannot save to parquet format;
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.checkConstraints(ParquetRelation.scala:190)
at org.apache.spark.sql.execution.datasources.parquet.ParquetRelation.dataSchema(ParquetRelation.scala:199)
at org.apache.spark.sql.sources.HadoopFsRelation.schema$lzycompute(interfaces.scala:561)
at org.apache.spark.sql.sources.HadoopFsRelation.schema(interfaces.scala:560)
at org.apache.spark.sql.execution.datasources.LogicalRelation.<init>(LogicalRelation.scala:37)
at org.apache.spark.sql.SQLContext.baseRelationToDataFrame(SQLContext.scala:395)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:267)
at org.apache.spark.sql.SQLContext.parquetFile(SQLContext.scala:1052)
:
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:674)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:120)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)*
如何确保正确写入镶木地板文件?有人知道如何解决此重复列错误吗?
【问题讨论】:
标签: apache-spark amazon-s3 parquet