【发布时间】:2017-06-01 16:46:32
【问题描述】:
我已阅读how to configure s3 access keys for dataproc 上发布的答案,但我觉得它不满意。
原因是当我按照步骤为spark.hadoop.fs.s3 设置hadoop conf 时,s3://... 路径仍然存在访问问题,而s3a://... 路径有效。测试 spark-shell 运行如下所示。
s3 vs s3n vs s3a 是它自己的话题,虽然我想我们不必担心 s3n。但我觉得奇怪的是 s3 的配置应用程序是 s3a 的清单。
这是我的问题:
这是 dataproc 还是 spark?我假设 spark 考虑到 spark-shell 有这个问题。
有没有办法在不更改代码的情况下从 spark-submit conf 标志配置
s3?这是一个错误还是我们现在更喜欢
s3a而不是`s3?
谢谢,
***@!!!:~$ spark-shell --conf spark.hadoop.fs.s3.awsAccessKeyId=CORRECT_ACCESS_KEY \
> --conf spark.hadoop.fs.s3.awsSecretAccessKey=SECRETE_KEY
// Try to read existing path, which breaks...
scala> spark.read.parquet("s3://bucket/path/to/folder")
17/06/01 16:19:58 WARN org.apache.spark.sql.execution.datasources.DataSource: Error while looking for metadata directory.
java.io.IOException: /path/to/folder doesn't exist
at org.apache.hadoop.fs.s3.Jets3tFileSystemStore.get(Jets3tFileSystemStore.java:170)
...
// notice `s3` not `s3a`
scala> spark.conf.getAll("spark.hadoop.fs.s3.awsAccessKeyId")
res3: String = CORRECT_ACCESS_KEY
scala> spark.conf.getAll("fs.s3.awsAccessKeyId")
java.util.NoSuchElementException: key not found: fs.s3.awsAccessKeyId
at scala.collection.MapLike$class.default(MapLike.scala:228)
at scala.collection.AbstractMap.default(Map.scala:59)
at scala.collection.MapLike$class.apply(MapLike.scala:141)
at scala.collection.AbstractMap.apply(Map.scala:59)
... 48 elided
scala> sc
res5: org.apache.spark.SparkContext = org.apache.spark.SparkContext@426bf2f2
scala> sc.hadoopConfiguration
res6: org.apache.hadoop.conf.Configuration = Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, file:/etc/hive/conf.dist/hive-site.xml
scala> sc.hadoopConfiguration.get("fs.s3.access.key")
res7: String = null <--- ugh... wtf?
scala> sc.hadoopConfiguration.get("fs.s3n.access.key")
res10: String = null <--- I understand this...
scala> sc.hadoopConfiguration.get("fs.s3a.access.key")
res8: String = CORRECT_ACCESS_KEY <--- But what is this???
// Successfull file read
scala> spark.read.parquet("s3a://bucket/path/to/folder")
ivysettings.xml file not found in HIVE_HOME or HIVE_CONF_DIR,/etc/hive/conf.dist/ivysettings.xml will be used
res9: org.apache.spark.sql.DataFrame = [whatev... ... 22 more fields]
【问题讨论】:
标签: apache-spark amazon-s3 google-cloud-dataproc