这是一个 scala 版本,可以与 Hadoop 3.3.1 的 Spark 3.2.1(预构建)一起正常工作,访问来自非 AWS 机器的 S3 存储桶[通常是开发人员机器上的本地设置]
sbt
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "3.2.1" % "provided",
"org.apache.spark" %% "spark-streaming" % "3.2.1" % "provided",
"org.apache.spark" %% "spark-sql" % "3.2.1" % "provided",
"org.apache.hadoop" % "hadoop-aws" % "3.3.1",
"org.apache.hadoop" % "hadoop-common" % "3.3.1" % "provided"
)
火花程序
val spark = SparkSession
.builder()
.master("local")
.appName("Process parquet file")
.config("spark.hadoop.fs.s3a.path.style.access", true)
.config("spark.hadoop.fs.s3a.access.key", ACCESS_KEY)
.config("spark.hadoop.fs.s3a.secret.key", SECRET_KEY)
.config("spark.hadoop.fs.s3a.endpoint", ENDPOINT)
.config(
"spark.hadoop.fs.s3a.impl",
"org.apache.hadoop.fs.s3a.S3AFileSystem"
)
// The enable V4 does not seem necessary for the eu-west-3 region
// see @stevel comment below
// .config("com.amazonaws.services.s3.enableV4", true)
// .config(
// "spark.driver.extraJavaOptions",
// "-Dcom.amazonaws.services.s3.enableV4=true"
// )
.config("spark.executor.instances", "4")
.getOrCreate()
spark.sparkContext.setLogLevel("ERROR")
val df = spark.read.parquet("s3a://[BUCKET NAME]/.../???.parquet")
df.show()
注意:区域格式为s3.[REGION].amazonaws.com,例如s3.eu-west-3.amazonaws.com
s3 配置
要使存储桶在 AWS 外部可用,请添加以下形式的存储桶策略:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Statement1",
"Effect": "Allow",
"Principal": {
"AWS": "arn:aws:iam::[ACCOUNT ID]:user/[IAM USERNAME]"
},
"Action": [
"s3:Delete*",
"s3:Get*",
"s3:List*",
"s3:PutObject"
],
"Resource": "arn:aws:s3:::[BUCKET NAME]/*"
}
]
}
提供给 spark 配置的 ACCESS_KEY 和 SECRET_KEY 必须是存储桶上配置的 IAM 用户的那些