【问题标题】:How to add new field to two levels nested struct column如何将新字段添加到两级嵌套结构列
【发布时间】:2022-08-13 20:46:48
【问题描述】:

我有一个数据框,其架构如下

 root
     |-- ts: timestamp (nullable = true)
     |-- address_list: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- id: string (nullable = true)
     |    |    |-- active: integer (nullable = true)
     |    |    |-- address: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- street: string (nullable = true)
     |    |    |    |    |-- city: long (nullable = true)
     |    |    |    |    |-- state: integer (nullable = true)

想要将新字段 street_2 添加到其嵌套列之一 - 街道和城市之间的 address_list.address。

以下是预期的架构

 root
     |-- ts: timestamp (nullable = true)
     |-- address_list: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- id: string (nullable = true)
     |    |    |-- active: integer (nullable = true)
     |    |    |-- address: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- street: string (nullable = true)
     |    |    |    |    |-- street_2: string (nullable = true)
     |    |    |    |    |-- city: long (nullable = true)
     |    |    |    |    |-- state: integer (nullable = true)

我确实尝试过使用转换,但最后将 street_2 字段添加到 address_list

df
.withColumn(\"address_list\",transform(col(\"address_list\"), x => x.withField(\"street_2\", lit(null).cast(string))))

 root
     |-- ts: timestamp (nullable = true)
     |-- address_list: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- id: string (nullable = true)
     |    |    |-- active: integer (nullable = true)
     |    |    |-- address: array (nullable = true)
     |    |    |    |-- element: struct (containsNull = true)
     |    |    |    |    |-- street: string (nullable = true)
     |    |    |    |    |-- city: long (nullable = true)
     |    |    |    |    |-- state: integer (nullable = true)
     |    |    |-- street_2: string (nullable = true)

我想要的地址在哪里,并插入街道和城市之间

    标签: scala apache-spark struct apache-spark-sql field


    【解决方案1】:

    你可以试试这个:

    
    data.printSchema
    
    val result = data.withColumn(
      "person_details", 
      transform(col("person_details"), x => x.withField("person.details.age", lit(40))))
    
    result.printSchema
    
    root
     |-- person_details: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- person: struct (nullable = true)
     |    |    |    |-- name: string (nullable = true)
     |    |    |    |-- details: struct (nullable = true)
     |    |    |    |    |-- city: string (nullable = true)
     |    |    |    |    |-- income: long (nullable = false)
    
    root
     |-- person_details: array (nullable = true)
     |    |-- element: struct (containsNull = true)
     |    |    |-- person: struct (nullable = true)
     |    |    |    |-- name: string (nullable = true)
     |    |    |    |-- details: struct (nullable = true)
     |    |    |    |    |-- city: string (nullable = true)
     |    |    |    |    |-- income: long (nullable = false)
     |    |    |    |    |-- age: integer (nullable = false)
    
    

    我从这篇文章中得到了帮助: https://medium.com/@fqaiser94/manipulating-nested-data-just-got-easier-in-apache-spark-3-1-1-f88bc9003827

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-07-24
      • 2015-11-28
      • 2021-12-24
      • 1970-01-01
      • 2015-07-27
      • 2013-01-27
      • 1970-01-01
      • 2020-02-29
      相关资源
      最近更新 更多