【发布时间】:2020-08-16 12:11:23
【问题描述】:
我正在尝试使用 s3a 写入本地 s3 存储桶,因此我的 spark writeStream() API 使用路径作为 s3a://test-bucket/。为了确保 spark 理解这一点,我在 build.sbt 中添加了 hadoop-aws-2.7.4.jar 和 aws-java-sdk-1.7.4.jar 并在代码中配置了 hadoop,如下所示 -
spark.sparkContext.hadoopConfiguration.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint", ENDPOINT);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key", ACCESS_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key", SECRET_KEY);
spark.sparkContext.hadoopConfiguration.set("fs.s3a.path.style.access", "true")
现在我尝试将数据写入我的自定义 s3 端点,如下所示 -
val dataStreamWriter: DataStreamWriter[Row] = PM25quality.select(
dayofmonth(current_date()) as "day",
month(current_date()) as "month",
year(current_date()) as "year",
column("time"),
column("quality"),
column("PM25"))
.writeStream
.partitionBy("year", "month", "day")
.format("csv")
.outputMode("append")
.option("path", "s3a://test-bucket/")
val streamingQuery: StreamingQuery = dataStreamWriter.start()
但似乎这种路径式访问启用不起作用,它仍在读取 URL 之前的存储桶名称 -
20/05/01 15:39:02 INFO AmazonHttpClient: Unable to execute HTTP request: test-bucket.s3-region0.cloudian.com
java.net.UnknownHostException: test-bucket.s3-region0.cloudian.com
如果我在这里遗漏了什么,有人可以告诉我吗?
【问题讨论】:
-
貌似是2.8.0以后实现的,issues.apache.org/jira/browse/HADOOP-12963
-
感谢 @mazaneicha 为我工作。
标签: scala apache-spark amazon-s3 apache-spark-sql spark-structured-streaming