通过 pyspark 加载文件名中包含冒号的 Amazon S3 文件答案

【问题标题】：Load a Amazon S3 file which has colons within the filename through pyspark通过 pyspark 加载文件名中包含冒号的 Amazon S3 文件
【发布时间】：2015-12-04 16:36:33
【问题描述】：

我有一个 S3 存储桶，其中包含多个文件名中带有冒号的文件。

例子：

s3://my_bucket/my_data/en/2015120/batch:222:111:00000.jl.gz

我正在尝试将其加载到 spark RDD 中并按如下方式访问第一行。

my_data = sc.textFile("s3://my_bucket/my_data/en/2015120/batch:222:111:00000.jl.gz")
my_data.take(1)

但这会引发，

llegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI:

任何建议单独加载这些文件，或者最好作为整个文件夹加载

【问题讨论】：

你可以试试在文件名中使用*。像 's3://path/*.gz' 。我正在使用与您上面相同的东西，它对我有用。

标签： python amazon-s3 apache-spark pyspark

【解决方案1】：

我通过将冒号替换为 url 编码格式来实现它。

即

: 将替换为 %3A

要仔细检查，请单击 S3 中的一个对象并查看“链接”

【讨论】：

【解决方案2】：

对此的一种解决方案是使用自定义 FileSystem 实现，就像他们所做的 here (Totango Labs)

它的要点是您绕过了内部globStatus 函数，该函数试图将文件名解释为路径，而不是使用listStatus。缺点是虽然这将允许您使用带有冒号的 S3 URL，但它不允许您在 URL 中指定通配符。

final Configuration hadoopConf = sparkContext.hadoopConfiguration();
hadoopConf.set("fs." + CustomS3FileSystem.SCHEMA + ".impl",
  CustomS3FileSystem.class.getName());

public class CustomS3FileSystem extends NativeS3FileSystem {
  public static final String SCHEMA = "custom";

  @Override
  public FileStatus[] globStatus(final Path pathPattern, final PathFilter filter)
      throws IOException {
    final FileStatus[] statusList = super.listStatus(pathPattern);
    final List<FileStatus> result = Lists.newLinkedList();
    for (FileStatus fileStatus : statusList) {
      if (filter.accept(fileStatus.getPath())) {
        result.add(fileStatus);
      }
    }
    return result.toArray(new FileStatus[] {});
  }
}

【讨论】：

【解决方案3】：

请注意，为了访问 S3，您需要使用 s3n 架构，而不仅仅是 s3，如 Spark FAQ 中所述，否则 Hadoop 解析器 fails。

【讨论】：