Pyspark SQL：将具有结构数组的表转换为列答案

【问题标题】：Pyspark SQL: Transform table with array of struct to columnsPyspark SQL：将具有结构数组的表转换为列
【发布时间】：2020-10-16 12:27:38
【问题描述】：

我有 2 列（字符串、数组>）的 HIVE 表，如下所示：
||身份证||参数 ||
|| id1 || [{type=A, cnt=4}, {type=B, cnt=2}]
|| id2 || [{type=A, cnt=3}, {type=C, cnt=1}, {type=D, cnt=0}]
|| id3 || [{type=E, cnt=1}]

我需要将其转换为具有分隔的 int 列的表，其中列名是“类型”，值等于 cnt：

||编号 ||一个 ||乙 || C || D || E ||
|| id1 || 4 || 2 ||空||空||空||
|| id2 || 3 ||空|| 1 || 0 ||空||
|| id3 ||空||空||空||空|| 1 ||
转换表格的最佳和有效方法是什么？ Spark SQL 和 PySpark 风格。谢谢。

【问题讨论】：

分解数组，提取值（新列中的每个值，如果存在 - 值，如果不存在 - 0）然后按 id 分组并使用聚合求和函数。
请接受答案，而不是让它们打开。

标签： sql apache-spark pyspark

【解决方案1】：

试试这个 - 不确定是否需要总和，但假设是安全的：

from pyspark.sql import functions as F
from pyspark.sql.types import IntegerType

# Some variation in your data
df = spark.createDataFrame([(1, ["type=AA, cnt=4", "type=B, cnt=2222"]),
                            (2, ["type=AA, cnt=3", "type=C, cnt=1", "type=D, cnt=0"]),
                            (3, ["type=E, cnt=1"])],["id", "params"])
# Explode
df = df.select(df.id, F.explode(df.params))

# Make separate cols and trip leading strings & convert to Int
split_col = F.split(df['col'], ',')
df = df.withColumn('type', split_col.getItem(0)).withColumn('count', split_col.getItem(1)).drop('col')
df = df.withColumn('type',F.expr("substring(type, 6, length(type))")).withColumn('count',F.expr("substring(count, 6, length(count))").cast(IntegerType()))

# Pivot to your format
df.groupBy("id").pivot("type").agg(F.sum("count")).sort(F.col("id").asc()).show()

+---+----+----+----+----+----+
| id|  AA|   B|   C|   D|   E|
+---+----+----+----+----+----+
|  1|   4|2222|null|null|null|
|  2|   3|null|   1|   0|null|
|  3|null|null|null|null|   1|
+---+----+----+----+----+----+

【讨论】：