【问题标题】:explode json column using pyspark使用 pyspark 分解 json 列
【发布时间】:2023-01-03 15:22:42
【问题描述】:

我有如下数据框:

+-----------------------------------------------------------------------------------------------+-----------------------+
|value                                                                                          |timestamp              |
+-----------------------------------------------------------------------------------------------+-----------------------+
|{"after":{"id":1001,"first_name":"Sally","last_name":"Thomas","email":"sally.thomas@acme.com"}}|2023-01-03 11:02:11.975|
|{"after":{"id":1002,"first_name":"George","last_name":"Bailey","email":"gbailey@foobar.com"}}  |2023-01-03 11:02:11.976|
|{"after":{"id":1003,"first_name":"Edward","last_name":"Walker","email":"ed@walker.com"}}       |2023-01-03 11:02:11.976|
|{"after":{"id":1004,"first_name":"Anne","last_name":"Kretchmar","email":"annek@noanswer.org"}} |2023-01-03 11:02:11.976|
+-----------------------------------------------------------------------------------------------+-----------------------+
root
 |-- value: string (nullable = true)
 |-- timestamp: timestamp (nullable = true)

使用 pyspark 的预期结果:

+---------+-------------+-------------+-----------------------+
id        | first_name  | last_name   | email                 |
+---------+-------------+-------------+-----------------------+
1001      | Sally       | Thomas      | sally.thomas@acme.com |
1002      | George      | Bailey      | gbailey@foobar.com    |
1003      | Edward      | Walker      | ed@walker.com         |
1004      | Anne        | Kretchmar   | annek@noanswer.org    |

任何帮助表示赞赏

【问题讨论】:

    标签: python apache-spark pyspark


    【解决方案1】:

    您可以使用 pyspark 的 from_json 函数来解析 json 字符串。该函数需要要解析的格式。在您的情况下,您可以获得结构的结构。

    data_sdf. 
        withColumn('parsed_json', 
                   func.from_json('value', 
                                  'after struct<id: bigint, first_name: string, last_name: string, email: string>'
                                  )
                   ). 
        withColumn('inner_struct', func.col('parsed_json.after')). 
        selectExpr('ts', 'inner_struct.*'). 
        show(truncate=False)
    
    # +-----------------------+----+----------+---------+---------------------+
    # |ts                     |id  |first_name|last_name|email                |
    # +-----------------------+----+----------+---------+---------------------+
    # |2023-01-03 11:02:11.975|1001|Sally     |Thomas   |sally.thomas@acme.com|
    # |2023-01-03 11:02:11.976|1002|George    |Bailey   |gbailey@foobar.com   |
    # |2023-01-03 11:02:11.976|1003|Edward    |Walker   |ed@walker.com        |
    # |2023-01-03 11:02:11.976|1004|Anne      |Kretchmar|annek@noanswer.org   |
    # +-----------------------+----+----------+---------+---------------------+
    

    解析后的数据如下所示

    data_sdf. 
        withColumn('parsed_json', 
                   func.from_json('value', 
                                  'after struct<id: bigint, first_name: string, last_name: string, email: string>'
                                  )
                   ). 
        withColumn('inner_struct', func.col('parsed_json.after')). 
        show(truncate=False)
    
    # +-----------------------------------------------------------------------------------------------+-----------------------+----------------------------------------------+--------------------------------------------+
    # |value                                                                                          |ts                     |parsed_json                                   |inner_struct                                |
    # +-----------------------------------------------------------------------------------------------+-----------------------+----------------------------------------------+--------------------------------------------+
    # |{"after":{"id":1001,"first_name":"Sally","last_name":"Thomas","email":"sally.thomas@acme.com"}}|2023-01-03 11:02:11.975|{{1001, Sally, Thomas, sally.thomas@acme.com}}|{1001, Sally, Thomas, sally.thomas@acme.com}|
    # |{"after":{"id":1002,"first_name":"George","last_name":"Bailey","email":"gbailey@foobar.com"}}  |2023-01-03 11:02:11.976|{{1002, George, Bailey, gbailey@foobar.com}}  |{1002, George, Bailey, gbailey@foobar.com}  |
    # |{"after":{"id":1003,"first_name":"Edward","last_name":"Walker","email":"ed@walker.com"}}       |2023-01-03 11:02:11.976|{{1003, Edward, Walker, ed@walker.com}}       |{1003, Edward, Walker, ed@walker.com}       |
    # |{"after":{"id":1004,"first_name":"Anne","last_name":"Kretchmar","email":"annek@noanswer.org"}} |2023-01-03 11:02:11.976|{{1004, Anne, Kretchmar, annek@noanswer.org}} |{1004, Anne, Kretchmar, annek@noanswer.org} |
    # +-----------------------------------------------------------------------------------------------+-----------------------+----------------------------------------------+--------------------------------------------+
    
    # root
    #  |-- value: string (nullable = true)
    #  |-- ts: string (nullable = true)
    #  |-- parsed_json: struct (nullable = true)
    #  |    |-- after: struct (nullable = true)
    #  |    |    |-- id: long (nullable = true)
    #  |    |    |-- first_name: string (nullable = true)
    #  |    |    |-- last_name: string (nullable = true)
    #  |    |    |-- email: string (nullable = true)
    #  |-- inner_struct: struct (nullable = true)
    #  |    |-- id: long (nullable = true)
    #  |    |-- first_name: string (nullable = true)
    #  |    |-- last_name: string (nullable = true)
    #  |    |-- email: string (nullable = true)
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2020-07-11
      • 2020-11-24
      • 2018-12-06
      • 1970-01-01
      • 1970-01-01
      • 2018-03-02
      • 2019-07-29
      • 2021-01-12
      相关资源
      最近更新 更多