更改火花数据框列的架构答案

【问题标题】：Change schema of spark dataframe column更改火花数据框列的架构
【发布时间】：2020-06-07 17:24:50
【问题描述】：

我有一个带有“学生”列的 pyspark 数据框。

一项数据如下：

{
   "Student" : {
       "m" : {
           "name" : {"s" : "john"},
           "score": {"s" : "165"}
       }
   }
}

我想更改此列的架构，使条目如下所示：

{
    "Student" : 
    {
        "m" : 
        {
            "StudentDetails" : 
            {
                "m" : 
                {
                    "name" : {"s" : "john"},
                    "score": {"s" : "165"}
                }
            }
        }
    } 
}

问题是学生字段在数据框中也可以为空。所以我想保留空值但更改非空值的架构。我在上述过程中使用了 udf。

        def Helper_ChangeSchema(row):
            #null check
            if row is None:
                return None
            #change schema
            data = row.asDict(True)
            return {"m":{"StudentDetails":data}}

但 udf 是 spark 的黑匣子。是否有任何方法可以使用内置的 spark 函数或 sql 查询来做同样的事情。

【问题讨论】：

与this问题有什么区别？
{Student:null} 也可以在数据中。
this 答案有什么问题？为什么它不能与 null 一起使用？顺便说一句：你为什么不accept呢？
是的，很酷的解决方案:)。但我很难在这里应用它

标签： python dataframe apache-spark pyspark apache-spark-sql

【解决方案1】：

它的工作方式与this answer 完全相同。只需在结构中添加另一个嵌套级别：

作为 SQL 表达式：

processedDf = df.withColumn("student", F.expr("named_struct('m', named_struct('student_details', student))"))

或在 Python 代码中使用 struct function:

processedDf = df.withColumn("student", F.struct(F.struct(F.col("student")).alias('m')))

两个版本的结果相同：

root
 |-- student: struct (nullable = false)
 |    |-- m: struct (nullable = false)
 |    |    |-- student_details: struct (nullable = true)
 |    |    |    |-- m: struct (nullable = true)
 |    |    |    |    |-- name: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)
 |    |    |    |    |-- score: struct (nullable = true)
 |    |    |    |    |    |-- s: string (nullable = true)

这两种方法也适用于空行。使用这个输入数据

data ='{"student" : {"m" : {"name" : {"s" : "john"},"score": {"s" : "165"}}}}'
data2='{"student": null }'
df = spark.read.json(sc.parallelize([data, data2]))

processedDf.show(truncate=False) 打印

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|[[]]                 |
+---------------------+

编辑：如果整行应该设置为空而不是结构的字段，您可以添加when

processedDf = df.withColumn("student", F.when(F.col("student").isNull(), F.lit(None)).otherwise(F.struct(F.struct(F.col("student")).alias('m'))))

这将导致相同的架构，但空行的输出不同：

+---------------------+
|student              |
+---------------------+
|[[[[[john], [165]]]]]|
|null                 |
+---------------------+

【讨论】：

行可以为空也可以为空。此解决方案将处理空行，但如何处理空行
我认为空行可以正常工作。我已经添加了我的测试数据。