【发布时间】:2023-03-25 09:48:01
【问题描述】:
尝试在 PySpark UDF 中创建新列,但值为 null!
创建 DF
data_list = [['a', [1, 2, 3]], ['b', [4, 5, 6]],['c', [2, 4, 6, 8]],['d', [4, 1]],['e', [1,2]]]
all_cols = ['COL1','COL2']
df = sqlContext.createDataFrame(data_list, all_cols)
df.show()
+----+------------+
|COL1| COL2|
+----+------------+
| a| [1, 2, 3]|
| b| [4, 5, 6]|
| c|[2, 4, 6, 8]|
| d| [4, 1]|
| e| [1, 2]|
+----+------------+
df.printSchema()
root
|-- COL1: string (nullable = true)
|-- COL2: array (nullable = true)
| |-- element: long (containsNull = true)
创建函数
def cr_pair(idx_src, idx_dest):
idx_dest.append(idx_dest.pop(0))
return idx_src, idx_dest
lst1 = [1,2,3]
lst2 = [1,2,3]
cr_pair(lst1, lst2)
([1, 2, 3], [2, 3, 1])
创建和注册 UDF
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType
from pyspark.sql.types import ArrayType
get_idx_pairs = udf(lambda x: cr_pair(x, x), ArrayType(IntegerType()))
向 DF 添加新列
df = df.select('COL1', 'COL2', get_idx_pairs('COL2').alias('COL3'))
df.printSchema()
root
|-- COL1: string (nullable = true)
|-- COL2: array (nullable = true)
| |-- element: long (containsNull = true)
|-- COL3: array (nullable = true)
| |-- element: integer (containsNull = true)
df.show()
+----+------------+------------+
|COL1| COL2| COL3|
+----+------------+------------+
| a| [1, 2, 3]|[null, null]|
| b| [4, 5, 6]|[null, null]|
| c|[2, 4, 6, 8]|[null, null]|
| d| [4, 1]|[null, null]|
| e| [1, 2]|[null, null]|
+----+------------+------------+
问题出在哪里。 我在 COL3 列中得到所有值“null”。 预期的结果应该是:
+----+------------+----------------------------+
|COL1| COL2| COL3|
+----+------------+----------------------------+
| a| [1, 2, 3]|[[1 ,2, 3], [2, 3, 1]] |
| b| [4, 5, 6]|[[4, 5, 6], [5, 6, 4]] |
| c|[2, 4, 6, 8]|[[2, 4, 6, 8], [4, 6, 8, 2]]|
| d| [4, 1]|[[4, 1], [1, 4]] |
| e| [1, 2]|[[1, 2], [2, 1]] |
+----+------------+----------------------------+
【问题讨论】:
标签: apache-spark pyspark apache-spark-sql user-defined-functions