将结构数组分解为 Spark 中的列答案

【问题标题】：Explode array of structs to columns in Spark将结构数组分解为 Spark 中的列
【发布时间】：2018-05-13 04:54:18
【问题描述】：

我想将结构数组分解为列（由结构字段定义）。例如

root
 |-- arr: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- id: long (nullable = false)
 |    |    |-- name: string (nullable = true)

应该转化为

root
 |-- id: long (nullable = true)
 |-- name: string (nullable = true)

我可以做到这一点

df
  .select(explode($"arr").as("tmp"))
  .select($"tmp.*")

如何在单个 select 语句中做到这一点？

我认为这可以工作，不幸的是它没有：

df.select(explode($"arr")(".*"))

线程“主”org.apache.spark.sql.AnalysisException 中的异常：否这样的结构字段 .* in col;

【问题讨论】：

标签： scala apache-spark dataframe

【解决方案1】：

单步解决方案仅适用于MapType 列：

val df = Seq(Tuple1(Map((1L, "bar"), (2L, "foo")))).toDF

df.select(explode($"_1") as Seq("foo", "bar")).show

+---+---+
|foo|bar|
+---+---+
|  1|bar|
|  2|foo|
+---+---+

你可以使用数组flatMap:

val df = Seq(Tuple1(Array((1L, "bar"), (2L, "foo")))).toDF
df.as[Seq[(Long, String)]].flatMap(identity)

单个SELECT语句可以用SQL编写：

 df.createOrReplaceTempView("df")

spark.sql("SELECT x._1, x._2 FROM df LATERAL VIEW explode(_1) t AS x")

【讨论】：

第一个使用 Map 的解决方案与 O/P 的架构不匹配，第二个解决方案类似于使用 O/P 已经实现的两个选择。不是这样吗？