如何在 Avro 中从 S3 读取不同的分区格式到 Spark？答案

【问题标题】：How to read different partition formats in Avro from S3 to Spark?如何在 Avro 中从 S3 读取不同的分区格式到 Spark？
【发布时间】：2019-04-14 20:56:08
【问题描述】：

我有一个具有两种分区格式的 S3 存储桶：

S3://bucketname/tablename/year/month/day
S3://bucketname/tablename/device/year/month/day

文件格式为 Avro。

我尝试通过val df = spark.read.format("com.databricks.spark.avro").load("s3://S3://bucketname/tablename") 阅读。

错误信息是

java.lang.AssertionError: assertion failed: Conflicting partition column names detected:

    Partition column name list #0: xx, yy
    Partition column name list #1: xx

For partitioned table directories, data files should only live in leaf directories.
And directories at the same level should have the same partition column name.
Please check the following directories for unexpected files or inconsistent partition column names:

【问题讨论】：

标签： apache-spark amazon-s3 apache-spark-sql avro

【解决方案1】：

您不能同时阅读它们。如错误本身所述，

同一级别的目录应具有相同的分区列名字。

分别读取它们（使用 2 个 s3 路径直到叶子），然后如果架构匹配，您可以 union 输入 DF。

【讨论】：