PySpark 等效于 Scala API 中的函数“typedLit”答案

【问题标题】：PySpark equivalent of function "typedLit" from Scala APIPySpark 等效于 Scala API 中的函数“typedLit”
【发布时间】：2020-09-18 17:54:33
【问题描述】：

我们在Scala API for Spark 中有一个函数typedLit 可以将Array 或Map 添加为列值。

import org.apache.spark.sql.functions.typedLit
val df1 = Seq((1, 0), (2, 3)).toDF("a", "b")

df1.withColumn("seq", typedLit(Seq(1,2,3)))
    .show(truncate=false)

+---+---+---------+
|a  |b  |seq      |
+---+---+---------+
|1  |0  |[1, 2, 3]|
|2  |3  |[1, 2, 3]|
+---+---+---------+

我在 PySpark 中找不到等价物。我们如何在 PySpark 中创建一个以 Array 为列值的列？

【问题讨论】：

pyspark 中没有 typedLit 的等效功能，但是您可以结合使用 array 和 lit df1.withColumn("seq" , array([lit(x) for x in [1,2,3] ]) )

标签： scala apache-spark pyspark apache-spark-sql

【解决方案1】：

pyspark 中还没有等价的函数，但是你可以有一个数组列，如下所示：

from pyspark.sql.functions import array, lit
df = sc.parallelize([[1,2], [3,4]]).toDF(['a', 'b'])
df.withColumn('seq', array([lit(i) for i in [1,2,3]])).show()

输出：

+---+---+---------+                                                             
|  a|  b|      seq|
+---+---+---------+
|  1|  2|[1, 2, 3]|
|  3|  4|[1, 2, 3]|
+---+---+---------+

【讨论】：

【解决方案2】：

使用expr 和array 对我来说看起来最优雅：

df = df.withColumn('seq', F.expr('array(1,2,3)'))

测试结果：

from pyspark.sql import SparkSession, functions as F

spark = SparkSession.builder.getOrCreate()

df = spark.createDataFrame([(1,0), (2,3)], ['a', 'b'])
df = df.withColumn('seq', F.expr('array(1,2,3)'))
df.show()
#  +---+---+---------+
#  |  a|  b|      seq|
#  +---+---+---------+
#  |  1|  0|[1, 2, 3]|
#  |  2|  3|[1, 2, 3]|
#  +---+---+---------+

如果数组编号需要按顺序排列，请使用F.expr('sequence(1,3)')。

【讨论】：

【解决方案3】：

您可以在lit() 调用之后直接使用.cast() 来输入Column：

import pyspark.sql.functions as sf
from pyspark.sql.types import LongType

df1.withColumn("long", sf.lit(1).cast(LongType()))

array() 也是如此：

import pyspark.sql.functions as sf
from pyspark.sql.types import LongType, ArrayType
df1.withColumn("pirate", sf.array([sf.lit(x).cast(LongType()) for x in [1, 2, 3]]))
df1.withColumn("pirate", sf.array([sf.lit(x) for x in [1, 2, 3]]).cast(ArrayType(LongType())))

如果你真的喜欢文字和打字但讨厌打字，你可以使用：

df1.withColumn("pirate", sf.array(sf.lit("1"), sf.lit("2")).cast("array<int>"))

;)

PS 也考虑使用 map 和 sf.lit 而不是 for 理解。

【讨论】：