从 s3 加载带前缀的镶木地板文件 - 可疑路径答案

【问题标题】：Loading prefixed parquet files from s3 - Suspicious paths从 s3 加载带前缀的镶木地板文件 - 可疑路径
【发布时间】：2018-07-28 19:24:24
【问题描述】：

我有一组前缀（根据 S3 性能建议）镶木地板文件，我想在 spark 中加载（使用 Amazon EMR 5.11.1），但是

列出与 glob 匹配的文件集所用的时间比不带前缀的文件要慢得多，这可以改进吗？
如何避免以下错误？

 val df = spark.read.parquet("s3://bucket/????/analytics")

java.lang.AssertionError: assertion failed: Conflicting directory
     structures detected. Suspicious paths:?
        s3://bucket/4a73/analytics
        s3://bucket/8163/analytics

If provided paths are partition directories, please set "basePath" in the options of the data source to specify the root directory of the table. If there are multiple root directories, please load them separately and then union them.
  at scala.Predef$.assert(Predef.scala:170)
  at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:132)
  at org.apache.spark.sql.execution.datasources.PartitioningUtils$.parsePartitions(PartitioningUtils.scala:97)
  at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.inferPartitioning(PartitioningAwareFileIndex.scala:153)
  at org.apache.spark.sql.execution.datasources.InMemoryFileIndex.partitionSpec(InMemoryFileIndex.scala:70)
  at org.apache.spark.sql.execution.datasources.PartitioningAwareFileIndex.partitionSchema(PartitioningAwareFileIndex.scala:50)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:134)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:353)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:559)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:543)
  ... 48 elided

【问题讨论】：

我认为如果您期望负载超过 100 个查询/秒，建议在 S3 对象键中使用哈希前缀。从哪里读取所有镶木地板文件的单一路径确实会容易得多，例如s3://bucket/analytics。如果您的 QPS 没有真正超过 100，您可能需要重新考虑您的目录结构，除非有其他要求？

标签： apache-spark amazon-s3 emr amazon-emr

【解决方案1】：

您可以使用 s3a 代替 s3。这可能适合您。

1.您还需要类路径中的 hadoop-aws 2.7.1 JAR。这个 JAR 包含

class org.apache.hadoop.fs.s3a.S3AFileSystem.

2.在 spark.properties 中可以这样设置：

spark.hadoop.fs.s3a.access.key=ACCESSKEY  
spark.hadoop.fs.s3a.secret.key=SECRETKEY

【讨论】：

在 Amazon EMR s3:// 上是与 Amazon 自己的（闭源）连接器的绑定。我相信这是唯一的支持。
这可能有助于提高性能，尽管我怀疑上面的错误消息没有。