在 Pyspark 中读取和保存图像文件答案

【问题标题】：Reading and Saving Image File in Pyspark在 Pyspark 中读取和保存图像文件
【发布时间】：2020-12-26 19:19:58
【问题描述】：

我需要从 S3 存储桶中读取图像并将其转换为 base64 编码格式。

我能够从 S3 读取图像文件，但是当我在 base64 方法中传递 S3 文件路径时，它无法识别路径。

所以我想我会将图像数据帧（与图像相同）保存在集群中的临时路径中，然后在 base64 方法中传递路径。

但在保存图像数据帧时，我遇到以下错误：（最初我尝试使用“图像”格式保存图像数据帧，但在 Google 中我发现这种格式存在错误，有人建议在下面使用格式)

java.lang.ClassNotFoundException: Failed to find data source: org.apache.spark.ml.source.image.PatchedImageFileFormat.

请看下面我的示例代码，请告诉我在哪里可以找到依赖包

spark._jsc.hadoopConfiguration().set('fs.s3a.access.key', '************')
spark._jsc.hadoopConfiguration().set('fs.s3a.secret.key', '************')
spark._jsc.hadoopConfiguration().set('fs.s3a.endpoint', '************')

def getImageStr(img):
  with open(img, "rb") as imageFile:
     str1 = base64.b64encode(imageFile.read())
     str2 = str(str1, 'utf-8')
  return str2

img_df = spark.read\
  .format("image")\
  .load("s3a://xxx/yyy/zzz/hello.jpg")

img_df.printSchema()


img_df.write\
    .format("org.apache.spark.ml.source.image.PatchedImageFileFormat")\
    .save("/tmp/sample.jpg")

img_str = getImageStr("/tmp/sample.jpg")

print(img_str)

如果有其他方法可以在 Spark 中从 S3 下载图像文件，请告诉我（不使用 boto3 包）

【问题讨论】：

标签： python apache-spark pyspark databricks

【解决方案1】：

当您使用image data source 时，您将获得带有image 列的数据帧，其中有一个二进制有效负载-image.data 包含实际图像。然后您可以使用built-in function base64 对该列进行编码，然后您可以将编码表示写入文件。像这样的东西（未测试）：

from pyspark.sql.functions import base64, col
img_df = spark.read.format("image").load("s3a://xxx/yyy/zzz/hello.jpg")
proc_df = img_df.select(base64(col("image.data")).alias('encoded')
proc_df.coalesce(1).write.format("text").save('/tmp/sample.jpg')

【讨论】：