如何展平 Spark 数据框中的结构？答案

【问题标题】：How to flatten a struct in a Spark dataframe?如何展平 Spark 数据框中的结构？
【发布时间】：2016-12-09 18:48:08
【问题描述】：

我有一个具有以下结构的数据框：

 |-- data: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- keyNote: struct (nullable = true)
 |    |    |-- key: string (nullable = true)
 |    |    |-- note: string (nullable = true)
 |    |-- details: map (nullable = true)
 |    |    |-- key: string
 |    |    |-- value: string (valueContainsNull = true)

如何展平结构并创建新的数据框：

     |-- id: long (nullable = true)
     |-- keyNote: struct (nullable = true)
     |    |-- key: string (nullable = true)
     |    |-- note: string (nullable = true)
     |-- details: map (nullable = true)
     |    |-- key: string
     |    |-- value: string (valueContainsNull = true)

有没有像爆炸这样的东西，但对于结构？

【问题讨论】：

stackoverflow.com/questions/37471346/… 的答案也很有帮助。
这里还提供了一个不错的解决方案：stackoverflow.com/questions/47285871/…

标签： java apache-spark pyspark apache-spark-sql

【解决方案1】：

基于https://stackoverflow.com/a/49532496/17250408 这是具有多级嵌套的struct 和array 字段的解决方案

from pyspark.sql.functions import col, explode


def type_cols(df_dtypes, filter_type):
    cols = []
    for col_name, col_type in df_dtypes:
        if col_type.startswith(filter_type):
            cols.append(col_name)
    return cols


def flatten_df(nested_df, sep='_'):
    nested_cols = type_cols(nested_df.dtypes, "struct")
    flatten_cols = [fc for fc, _ in nested_df.dtypes if fc not in nested_cols]
    for nc in nested_cols:
        for cc in nested_df.select(f"{nc}.*").columns:
            if sep is None:
                flatten_cols.append(col(f"{nc}.{cc}").alias(f"{cc}"))
            else:
                flatten_cols.append(col(f"{nc}.{cc}").alias(f"{nc}{sep}{cc}"))
    return nested_df.select(flatten_cols)


def explode_df(nested_df):
    nested_cols = type_cols(nested_df.dtypes, "array")
    exploded_df = nested_df
    for nc in nested_cols:
        exploded_df = exploded_df.withColumn(nc, explode(col(nc)))
    return exploded_df


def flatten_explode_df(nested_df):
    df = nested_df
    struct_cols = type_cols(nested_df.dtypes, "struct")
    array_cols = type_cols(nested_df.dtypes, "array")
    if struct_cols:
        df = flatten_df(df)
        return flatten_explode_df(df)
    if array_cols:
        df = explode_df(df)
        return flatten_explode_df(df)
    return df


df = flatten_explode_df(nested_df)

【讨论】：

【解决方案2】：

更紧凑、更高效的实现：

无需创建列表并对其进行迭代。您根据字段的类型（无论是否为结构）对字段进行“操作”。

您创建一个列表并对其进行迭代，如果该列是嵌套的（结构），您需要将其展平（.*），否则您使用点符号（parent.child）访问并替换。与 _ (parent_child)

def flatten_df(nested_df):
    stack = [((), nested_df)]
    columns = []
    while len(stack) > 0:
        parents, df = stack.pop()
        for column_name, column_type in df.dtypes:
            if column_type[:6] == "struct":
                projected_df = df.select(column_name + ".*")
                stack.append((parents + (column_name,), projected_df))
            else:
                columns.append(col(".".join(parents + (column_name,))).alias("_".join(parents + (column_name,))))
    return nested_df.select(columns)

【讨论】：

欢迎来到 Stack Overflow。在 Stack Overflow 上不鼓励仅使用代码的答案，因为它们没有解释它是如何解决问题的。请编辑您的答案以解释此代码的作用以及它如何比您所说的其他答案更有效，以便它对其他有类似问题的用户有用，他们可以从中学习。
非常适合我。如果您确定添加更多解释，您会更喜欢。

【解决方案3】：

我们使用https://github.com/lvhuyen/SparkAid 它适用于任何级别

从 sparkaid 导入扁平化

展平(df_nested_B).printSchema()

【讨论】：

【解决方案4】：

PySpark 解决方案可以用任何深度级别的结构和数组类型来展平嵌套的 df。对此进行了改进：https://stackoverflow.com/a/56533459/7131019

from pyspark.sql.types import *
from pyspark.sql import functions as f

def flatten_structs(nested_df):
    stack = [((), nested_df)]
    columns = []

    while len(stack) > 0:
        
        parents, df = stack.pop()
        
        array_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:5] == "array"
        ]
        
        flat_cols = [
            f.col(".".join(parents + (c[0],))).alias("_".join(parents + (c[0],)))
            for c in df.dtypes
            if c[1][:6] != "struct"
        ]

        nested_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:6] == "struct"
        ]
        
        columns.extend(flat_cols)

        for nested_col in nested_cols:
            projected_df = df.select(nested_col + ".*")
            stack.append((parents + (nested_col,), projected_df))
        
    return nested_df.select(columns)

def flatten_array_struct_df(df):
    
    array_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:5] == "array"
        ]
    
    while len(array_cols) > 0:
        
        for array_col in array_cols:
            
            cols_to_select = [x for x in df.columns if x != array_col ]
            
            df = df.withColumn(array_col, f.explode(f.col(array_col)))
            
        df = flatten_structs(df)
        
        array_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:5] == "array"
        ]
    return df

flat_df = flatten_array_struct_df(df)

【讨论】：

这很好用。我承担了谨慎使用它的“责任”，因为 flatten array of struct 可以像其他人所说的那样产生重复的行。

【解决方案5】：

对于 Spark 2.4.5，

而df.select(df.col("data.*")) 会给你org.apache.spark.sql.AnalysisException: No such struct field * in 异常

这将起作用：-

df.select($"data.*")

【讨论】：

当我选择为 df.select("data.*') 时，它给了我 n*n 行。（每行重复 n 行）。我的数据框有 n-2 个不同的行，所以如果我输入 distinct 它会给我 n-2 个结果。但我想要 n 个结果，它实际上存在于我的数据中。如何使用上述选择命令来实现这一点。

【解决方案6】：

这是用于 scala spark 的。

val totalMainArrayBuffer=collection.mutable.ArrayBuffer[String]()
def flatten_df_Struct(dfTemp:org.apache.spark.sql.DataFrame,dfTotalOuter:org.apache.spark.sql.DataFrame):org.apache.spark.sql.DataFrame=
{
//dfTemp.printSchema
val totalStructCols=dfTemp.dtypes.map(x => x.toString.substring(1,x.toString.size-1)).filter(_.split(",",2)(1).contains("Struct")) // in case i the column names come with the word Struct embedded in it
val mainArrayBuffer=collection.mutable.ArrayBuffer[String]()
for(totalStructCol <- totalStructCols)
{
val tempArrayBuffer=collection.mutable.ArrayBuffer[String]()
tempArrayBuffer+=s"${totalStructCol.split(",")(0)}.*"
//tempArrayBuffer.toSeq.toDF.show(false)
val columnsInside=dfTemp.selectExpr(tempArrayBuffer:_*).columns
for(column <- columnsInside)
mainArrayBuffer+=s"${totalStructCol.split(",")(0)}.${column} as ${totalStructCol.split(",")(0)}_${column}"
//mainArrayBuffer.toSeq.toDF.show(false)
}
//dfTemp.selectExpr(mainArrayBuffer:_*).printSchema
val nonStructCols=dfTemp.selectExpr(mainArrayBuffer:_*).dtypes.map(x => x.toString.substring(1,x.toString.size-1)).filter(!_.split(",",2)(1).contains("Struct")) // in case i the column names come with the word Struct embedded in it
for (nonStructCol <- nonStructCols)
totalMainArrayBuffer+=s"${nonStructCol.split(",")(0).replace("_",".")} as ${nonStructCol.split(",")(0)}" // replacing _ by . in origial select clause if it's an already nested column 
dfTemp.selectExpr(mainArrayBuffer:_*).dtypes.map(x => x.toString.substring(1,x.toString.size-1)).filter(_.split(",",2)(1).contains("Struct")).size 
match {
case value if value ==0 => dfTotalOuter.selectExpr(totalMainArrayBuffer:_*)
case _ => flatten_df_Struct(dfTemp.selectExpr(mainArrayBuffer:_*),dfTotalOuter)
}
}


def flatten_df(dfTemp:org.apache.spark.sql.DataFrame):org.apache.spark.sql.DataFrame=
{
var totalArrayBuffer=collection.mutable.ArrayBuffer[String]()
val totalNonStructCols=dfTemp.dtypes.map(x => x.toString.substring(1,x.toString.size-1)).filter(!_.split(",",2)(1).contains("Struct")) // in case i the column names come with the word Struct embedded in it
for (totalNonStructCol <- totalNonStructCols)
totalArrayBuffer+=s"${totalNonStructCol.split(",")(0)}"
totalMainArrayBuffer.clear
flatten_df_Struct(dfTemp,dfTemp) // flattened schema is now in totalMainArrayBuffer 
totalArrayBuffer=totalArrayBuffer++totalMainArrayBuffer
dfTemp.selectExpr(totalArrayBuffer:_*)
}


flatten_df(dfTotal.withColumn("tempStruct",lit(5))).printSchema

文件

{"num1":1,"num2":2,"bool1":true,"bool2":false,"double1":4.5,"double2":5.6,"str1":"a","str2":"b","arr1":[3,4,5],"map1":{"cool":1,"okay":2,"normal":3},"carInfo":{"Engine":{"Make":"sa","Power":{"IC":"900","battery":"165"},"Redline":"11500"} ,"Tyres":{"Make":"Pirelli","Compound":"c1","Life":"120"}}}
{"num1":3,"num2":4,"bool1":false,"bool2":false,"double1":4.2,"double2":5.5,"str1":"u","str2":"n","arr1":[6,7,9],"map1":{"fast":1,"medium":2,"agressive":3},"carInfo":{"Engine":{"Make":"na","Power":{"IC":"800","battery":"150"},"Redline":"10000"} ,"Tyres":{"Make":"Pirelli","Compound":"c2","Life":"100"}}}
{"num1":8,"num2":4,"bool1":true,"bool2":true,"double1":5.7,"double2":7.5,"str1":"t","str2":"k","arr1":[11,12,23],"map1":{"preserve":1,"medium":2,"fast":3},"carInfo":{"Engine":{"Make":"ta","Power":{"IC":"950","battery":"170"},"Redline":"12500"} ,"Tyres":{"Make":"Pirelli","Compound":"c3","Life":"80"}}}
{"num1":7,"num2":9,"bool1":false,"bool2":true,"double1":33.2,"double2":7.5,"str1":"b","str2":"u","arr1":[12,14,5],"map1":{"normal":1,"preserve":2,"agressive":3},"carInfo":{"Engine":{"Make":"pa","Power":{"IC":"920","battery":"160"},"Redline":"11800"} ,"Tyres":{"Make":"Pirelli","Compound":"c4","Life":"70"}}}

之前：

root
 |-- arr1: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- bool1: boolean (nullable = true)
 |-- bool2: boolean (nullable = true)
 |-- carInfo: struct (nullable = true)
 |    |-- Engine: struct (nullable = true)
 |    |    |-- Make: string (nullable = true)
 |    |    |-- Power: struct (nullable = true)
 |    |    |    |-- IC: string (nullable = true)
 |    |    |    |-- battery: string (nullable = true)
 |    |    |-- Redline: string (nullable = true)
 |    |-- Tyres: struct (nullable = true)
 |    |    |-- Compound: string (nullable = true)
 |    |    |-- Life: string (nullable = true)
 |    |    |-- Make: string (nullable = true)
 |-- double1: double (nullable = true)
 |-- double2: double (nullable = true)
 |-- map1: struct (nullable = true)
 |    |-- agressive: long (nullable = true)
 |    |-- cool: long (nullable = true)
 |    |-- fast: long (nullable = true)
 |    |-- medium: long (nullable = true)
 |    |-- normal: long (nullable = true)
 |    |-- okay: long (nullable = true)
 |    |-- preserve: long (nullable = true)
 |-- num1: long (nullable = true)
 |-- num2: long (nullable = true)
 |-- str1: string (nullable = true)
 |-- str2: string (nullable = true

之后：

root
 |-- arr1: array (nullable = true)
 |    |-- element: long (containsNull = true)
 |-- bool1: boolean (nullable = true)
 |-- bool2: boolean (nullable = true)
 |-- double1: double (nullable = true)
 |-- double2: double (nullable = true)
 |-- num1: long (nullable = true)
 |-- num2: long (nullable = true)
 |-- str1: string (nullable = true)
 |-- str2: string (nullable = true)
 |-- map1_agressive: long (nullable = true)
 |-- map1_cool: long (nullable = true)
 |-- map1_fast: long (nullable = true)
 |-- map1_medium: long (nullable = true)
 |-- map1_normal: long (nullable = true)
 |-- map1_okay: long (nullable = true)
 |-- map1_preserve: long (nullable = true)
 |-- carInfo_Engine_Make: string (nullable = true)
 |-- carInfo_Engine_Redline: string (nullable = true)
 |-- carInfo_Tyres_Compound: string (nullable = true)
 |-- carInfo_Tyres_Life: string (nullable = true)
 |-- carInfo_Tyres_Make: string (nullable = true)
 |-- carInfo_Engine_Power_IC: string (nullable = true)
 |-- carInfo_Engine_Power_battery: string (nullable = true)

尝试了 2 个级别，它成功了

【讨论】：

【解决方案7】：

如果你必须只转换结构类型，你可以使用这种方法。我不建议转换数组，因为它可能导致重复记录。

from pyspark.sql.functions import col
from pyspark.sql.types import StructType


def flatten_schema(schema, prefix=""):
    return_schema = []
    for field in schema.fields:
        if isinstance(field.dataType, StructType):
            if prefix:
                return_schema = return_schema + flatten_schema(field.dataType, "{}.{}".format(prefix, field.name))
            else:
                return_schema = return_schema + flatten_schema(field.dataType, field.name)
        else:
            if prefix:
                field_path = "{}.{}".format(prefix, field.name)
                return_schema.append(col(field_path).alias(field_path.replace(".", "_")))
            else:
                return_schema.append(field.name)
    return return_schema

你可以把它当作

new_schema = flatten_schema(df.schema)
df1 = df.select(se)
df1.show()

【讨论】：

【解决方案8】：

这是执行您想要的功能并且可以处理包含具有相同名称的列的多个嵌套列的函数：

import pyspark.sql.functions as F

def flatten_df(nested_df):
    flat_cols = [c[0] for c in nested_df.dtypes if c[1][:6] != 'struct']
    nested_cols = [c[0] for c in nested_df.dtypes if c[1][:6] == 'struct']

    flat_df = nested_df.select(flat_cols +
                               [F.col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols
                                for c in nested_df.select(nc+'.*').columns])
    return flat_df

之前：

root
 |-- x: string (nullable = true)
 |-- y: string (nullable = true)
 |-- foo: struct (nullable = true)
 |    |-- a: float (nullable = true)
 |    |-- b: float (nullable = true)
 |    |-- c: integer (nullable = true)
 |-- bar: struct (nullable = true)
 |    |-- a: float (nullable = true)
 |    |-- b: float (nullable = true)
 |    |-- c: integer (nullable = true)

之后：

root
 |-- x: string (nullable = true)
 |-- y: string (nullable = true)
 |-- foo_a: float (nullable = true)
 |-- foo_b: float (nullable = true)
 |-- foo_c: integer (nullable = true)
 |-- bar_a: float (nullable = true)
 |-- bar_b: float (nullable = true)
 |-- bar_c: integer (nullable = true)

【讨论】：

【解决方案9】：

下面在 spark sql 中为我工作

import org.apache.spark.sql._
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.fs._
import org.apache.http.client.methods.HttpGet
import org.apache.http.impl.client.DefaultHttpClient
import org.apache.spark.sql.{DataFrame, SaveMode, SparkSession}
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.StructType
import org.apache.log4j.{Level, Logger}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.types._
import org.apache.spark.storage.StorageLevel
import org.apache.spark.sql.functions.{explode, expr, posexplode, when}

object StackOverFlowQuestion {
  def main(args: Array[String]): Unit = {

    val logger = Logger.getLogger("FlattenTest")
    Logger.getLogger("org").setLevel(Level.WARN)
    Logger.getLogger("akka").setLevel(Level.WARN)

    val spark = SparkSession.builder()
      .appName("FlattenTest")
      .config("spark.sql.warehouse.dir", "C:\\Temp\\hive")
      .master("local[2]")
      //.enableHiveSupport()
      .getOrCreate()
    import spark.implicits._

    val stringTest =
      """{
                               "total_count": 123,
                               "page_size": 20,
                               "another_id": "gdbfdbfdbd",
                               "sen": [{
                                "id": 123,
                                "ses_id": 12424343,
                                "columns": {
                                    "blah": "blah",
                                    "count": 1234
                                },
                                "class": {},
                                "class_timestamps": {},
                                "sentence": "spark is good"
                               }]
                            }
                             """
    val result = List(stringTest)
    val githubRdd=spark.sparkContext.makeRDD(result)
    val gitHubDF=spark.read.json(githubRdd)
    gitHubDF.show()
    gitHubDF.printSchema()

    gitHubDF.registerTempTable("JsonTable")

   spark.sql("with cte as" +
      "(" +
      "select explode(sen) as senArray  from JsonTable" +
      "), cte_2 as" +
      "(" +
      "select senArray.ses_id,senArray.ses_id,senArray.columns.* from cte" +
      ")" +
      "select * from cte_2"
    ).show()

    spark.stop()
}

}

输出：-

+----------+---------+--------------------+-----------+
|another_id|page_size|                 sen|total_count|
+----------+---------+--------------------+-----------+
|gdbfdbfdbd|       20|[[[blah, 1234], 1...|        123|
+----------+---------+--------------------+-----------+

root
 |-- another_id: string (nullable = true)
 |-- page_size: long (nullable = true)
 |-- sen: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- columns: struct (nullable = true)
 |    |    |    |-- blah: string (nullable = true)
 |    |    |    |-- count: long (nullable = true)
 |    |    |-- id: long (nullable = true)
 |    |    |-- sentence: string (nullable = true)
 |    |    |-- ses_id: long (nullable = true)
 |-- total_count: long (nullable = true)

+--------+--------+----+-----+
|  ses_id|  ses_id|blah|count|
+--------+--------+----+-----+
|12424343|12424343|blah| 1234|
+--------+--------+----+-----+

【讨论】：

【解决方案10】：

这个flatten_df 版本在每一层都扁平化了数据帧，使用堆栈来避免递归调用：

from pyspark.sql.functions import col


def flatten_df(nested_df):
    stack = [((), nested_df)]
    columns = []

    while len(stack) > 0:
        parents, df = stack.pop()

        flat_cols = [
            col(".".join(parents + (c[0],))).alias("_".join(parents + (c[0],)))
            for c in df.dtypes
            if c[1][:6] != "struct"
        ]

        nested_cols = [
            c[0]
            for c in df.dtypes
            if c[1][:6] == "struct"
        ]

        columns.extend(flat_cols)

        for nested_col in nested_cols:
            projected_df = df.select(nested_col + ".*")
            stack.append((parents + (nested_col,), projected_df))

    return nested_df.select(columns)

例子：

from pyspark.sql.types import StringType, StructField, StructType


schema = StructType([
    StructField("some", StringType()),

    StructField("nested", StructType([
        StructField("nestedchild1", StringType()),
        StructField("nestedchild2", StringType())
    ])),

    StructField("renested", StructType([
        StructField("nested", StructType([
            StructField("nestedchild1", StringType()),
            StructField("nestedchild2", StringType())
        ]))
    ]))
])

data = [
    {
        "some": "value1",
        "nested": {
            "nestedchild1": "value2",
            "nestedchild2": "value3",
        },
        "renested": {
            "nested": {
                "nestedchild1": "value4",
                "nestedchild2": "value5",
            }
        }
    }
]

df = spark.createDataFrame(data, schema)
flat_df = flatten_df(df)
print(flat_df.collect())

打印：

[Row(some=u'value1', renested_nested_nestedchild1=u'value4', renested_nested_nestedchild2=u'value5', nested_nestedchild1=u'value2', nested_nestedchild2=u'value3')]

【讨论】：

这似乎不会递归到数组内的嵌套结构中。
@malthe 不会的。实际上，我认为这样做是不可行的。假设您使用数组索引作为列名（例如，array.0.field、array.1.field、...），您必须事先知道数组的长度。所有这些解决方案都会迭代驱动程序已知的数据帧结构。
我最终弄清楚了如何做到这一点并在此处发布了一个脚本：stackoverflow.com/a/66482320/647151。
哦，所以我们的想法是保留数组但转换它包含的结构。不错！

【解决方案11】：

一种简单的方法是使用 SQL，您可以构建一个 SQL 查询字符串来将嵌套列别名为扁平列。

检索数据框架构 (df.schema())
将架构转换为 SQL （对于（字段：schema().fields()）...

查询：

val newDF = sqlContext.sql("SELECT " + sqlGenerated + " FROM source")

这里是an example in Java。

（我更喜欢 SQL 方式，所以你可以在 Spark-shell 上轻松测试它，它是跨语言的）。

【讨论】：

【解决方案12】：

我对 stecos 的解决方案进行了更多概括，因此可以在超过两个深度的结构层上进行展平：

def flatten_df(nested_df, layers):
    flat_cols = []
    nested_cols = []
    flat_df = []

    flat_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] != 'struct'])
    nested_cols.append([c[0] for c in nested_df.dtypes if c[1][:6] == 'struct'])

    flat_df.append(nested_df.select(flat_cols[0] +
                               [col(nc+'.'+c).alias(nc+'_'+c)
                                for nc in nested_cols[0]
                                for c in nested_df.select(nc+'.*').columns])
                  )
    for i in range(1, layers):
        print (flat_cols[i-1])
        flat_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] != 'struct'])
        nested_cols.append([c[0] for c in flat_df[i-1].dtypes if c[1][:6] == 'struct'])

        flat_df.append(flat_df[i-1].select(flat_cols[i] +
                                [col(nc+'.'+c).alias(nc+'_'+c)
                                    for nc in nested_cols[i]
                                    for c in flat_df[i-1].select(nc+'.*').columns])
        )

    return flat_df[-1]

只需拨打电话：

my_flattened_df = flatten_df(my_df_having_nested_structs, 3)

（第二个参数是要展平的层级，在我的例子中是3）

【讨论】：

【解决方案13】：

这应该适用于 Spark 1.6 或更高版本：

df.select(df.col("data.*"))

或

df.select(df.col("data.id"), df.col("data.keyNote"), df.col("data.details"))

【讨论】：

线程“主”org.apache.spark.sql.AnalysisException 中的异常：没有这样的结构字段 *
但是在 df.select(df.col1, df.col2, df.col3) 之类的所有列上使用 select 有效，所以我会接受这个答案
我只是在编辑，但很奇怪。我可以用 *。可能是版本问题？
是的，也许。我正在使用 spark 1.6.1 和 scala 2.10
嵌套结构keyNote下如何选择key或note？