指定字符串长度超过 256 的 pyspark 数据框架构

【问题标题】：Specify pyspark dataframe schema with string longer than 256指定字符串长度超过 256 的 pyspark 数据框架构
【发布时间】：2023-03-30 08:32:01
【问题描述】：

我正在阅读一个描述超过 256 个字符的来源。我想把它们写到 Redshift。

根据：https://github.com/databricks/spark-redshift#configuring-the-maximum-size-of-string-columns 只能在 Scala 中实现。

据此：https://github.com/databricks/spark-redshift/issues/137#issuecomment-165904691 在创建数据框时指定架构应该是一种解决方法。我无法让它工作。

如何使用 varchar(max) 指定架构？

df = ...from source

schema = StructType([
    StructField('field1', StringType(), True),
    StructField('description', StringType(), True)
])

df = sqlContext.createDataFrame(df.rdd, schema)

【问题讨论】：

标签： apache-spark pyspark apache-spark-sql amazon-redshift

【解决方案1】：

Redshiftmaxlength注解以格式传递

{"maxlength":2048}

所以这是你应该传递给StructField构造函数的结构：

from pyspark.sql.types import StructField, StringType

StructField("description", StringType(), metadata={"maxlength":2048})

或别名方法：

from pyspark.sql.functions import col

col("description").alias("description", metadata={"maxlength":2048})

如果您使用 PySpark 2.2 或更早版本，请查看How to change column metadata in pyspark? 了解解决方法。

【讨论】：

将此设置为正确答案，即使我还没有让它工作，它也回答了我的问题。根据docs.databricks.com/spark/latest/data-sources/aws/…（Databricks 最近关闭了 spark-redshift 项目的源代码），它现在也应该在 python 中工作了