【发布时间】:2018-07-28 19:24:24
【问题描述】:
我有一组前缀(根据 S3 性能建议)镶木地板文件,我想在 spark 中加载(使用 Amazon EMR 5.11.1),但是
- 列出与 glob 匹配的文件集所用的时间比不带前缀的文件要慢得多,这可以改进吗?
- 如何避免以下错误?
val df = spark.read.parquet("s3://bucket/????/analytics")
java.lang.AssertionError: assertion failed: Conflicting directory
structures detected. Suspicious paths:?
s3://bucket/4a73/analytics
s3://bucket/8163/analytics
If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
at scala.Predef$.assert(Predef.scala:170)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:132)
at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:97)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:153)
at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:70)
at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:134)
at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)
at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)
... 48 elided
【问题讨论】:
-
我认为如果您期望负载超过 100 个查询/秒,建议在 S3 对象键中使用哈希前缀。从哪里读取所有镶木地板文件的单一路径确实会容易得多,例如
s3://bucket/analytics。如果您的 QPS 没有真正超过 100,您可能需要重新考虑您的目录结构,除非有其他要求?
标签: apache-spark amazon-s3 emr amazon-emr