将日期字符串传递到 withColumn答案

【问题标题】：Pass date string into withColumn将日期字符串传递到 withColumn
【发布时间】：2020-12-22 10:41:23
【问题描述】：

我正在使用 PySpark 并想将 yyyy_mm_dd 字符串作为列添加到我的 DataFrame 中，我尝试过这样做：

end_date = '2020-01-20'
final = (
    df1
    .join(df, on = ['id', 'product'], how = 'left_outer')
    .where((sf.col('id').isNotNull()))
    .withColumn('status', when(sf.col('count') >= 10, 3)
    .when((sf.col('count') <= 9) & (sf.col('count') >= 1), 2)
    .when(sf.col('count').isNull(), 1))
    .withColumn('yyyy_mm_dd', end_date)
)
final.fillna(0, subset=['count']).orderBy('id', 'product').show(500,False)

这在没有最后一个 .withColumn 的情况下有效，但是当我包含它时遇到以下错误：

AssertionError: col 应该是 Column

从docs 看来，我应该将col 作为第二个参数传递给withColumn。不过，我不确定如何将我的日期字符串转换为类型col。我从另一个帖子中看到了这个solution，但我不想使用current_date()，因为我的end_date var 将从协调器脚本中读取。

【问题讨论】：

标签： python apache-spark pyspark apache-spark-sql

【解决方案1】：

使用lit:

.withColumn('yyyy_mm_dd', sf.lit(end_date))

如果你想要一个日期类型，你可以相应地转换：

.withColumn('yyyy_mm_dd', sf.lit(end_date).cast("date"))

【讨论】：

【解决方案2】：

请查看 with_column 文档。它将列名作为第一个参数，将 col 类型作为第二个参数。您可以使用 lit() 将字符串转换为 col 使用 const 值。

pyspark.sql.functions.lit(col) 创建一个文字值列。

df.select(lit(5).alias('height')).withColumn('spark_user', lit(True)).take(1) [行（高度=5，spark_user=True）]

【讨论】：