【问题标题】:Converting a dataframe to a nested Json output将数据框转换为嵌套的 Json 输出
【发布时间】:2021-07-14 09:52:06
【问题描述】:

我有一个从数据派生的数据框,它给了我类似的东西

id identifier actual_cost cost_incurred timestamp
1 abc123 24 21 2021-04-16T19:07:00
2 xyz987 12 34 2021-04-16T19:25:27
2 xyz987 92 87 2021-04-16T19:32:43
1 abc123 37 39 2021-04-16T19:26:30
3 abc567 87 85 2021-04-16T19:13:00

我的要求是最终的转储文件应该将整个数据框作为嵌套的 JSON 格式

 {
"hits": [
    {
        "id": 1,
        "identifier": "abc123",
        "cost": [
            {
                "actual_cost": 24,
                "cost_incurred": 21,
                "timestamp": "2021-04-16T19:07:00"
            },
            {
                "actual_cost": 37,
                "cost_incurred": 39,
                "timestamp": "2021-04-16T19:26:30"
            }
        ]
    },
    {
        "id": 2,
        "identifier": "xyz987",
        "cost": [
            {
                "actual_cost": 12,
                "cost_incurred": 34,
                "timestamp": "2021-04-16T19:25:27"
            },
            {
                "actual_cost": 37,
                "cost_incurred": 39,
                "timestamp": "2021-04-16T19:26:30"
            }
        ]
    },
    {
        "id": 3,
        "identifier": "abc567",
        "cost": [
            {
                "actual_cost": 87,
                "cost_incurred": 85,
                "timestamp": "2021-04-16T19:13:00"
            }
        ]
    }
]
}

我正在查看 map 函数,但不知道如何对结果进行分组。 任何线索或解决方案将不胜感激。

【问题讨论】:

    标签: json dataframe apache-spark apache-spark-sql


    【解决方案1】:

    to_json 将成为你的朋友 :) 以及一些分组和聚合:

    df.createOrReplaceTempView("df")
    
    result = spark.sql("""
        select 
            to_json(struct(collect_list(item) hits)) result 
        from (
            select 
                struct(
                    id, identifier, collect_list(struct(actual_cost, cost_incurred, timestamp)) cost
                ) item 
            from df 
            group by id, identifier
        )
    """)
    
    result.show()
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |result                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    |{"hits":[{"id":"2","identifier":"xyz987","cost":[{"actual_cost":"12","cost_incurred":"34","timestamp":"2021-04-16T19:25:27"},{"actual_cost":"92","cost_incurred":"87","timestamp":"2021-04-16T19:32:43"}]},{"id":"1","identifier":"abc123","cost":[{"actual_cost":"24","cost_incurred":"21","timestamp":"2021-04-16T19:07:00"},{"actual_cost":"37","cost_incurred":"39","timestamp":"2021-04-16T19:26:30"}]},{"id":"3","identifier":"abc567","cost":[{"actual_cost":"87","cost_incurred":"85","timestamp":"2021-04-16T19:13:00"}]}]}|
    +------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    【讨论】:

      【解决方案2】:

      这里是如何使用 groupBy 一些聚合和 toJSON 来做到这一点

      val resultDf = df.groupBy("id", "identifier")
        .agg(collect_list(struct("actual_cost", "cost_incurred", "timestamp")) as "cost")
        .toJSON
      resultDf.show(false)
      

      结果:

      +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      |value                                                                                                                                                                                  |
      +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      |{"id":2,"identifier":"xyz987","cost":[{"actual_cost":12,"cost_incurred":34,"timestamp":"2021-04-16T19:25:27"},{"actual_cost":92,"cost_incurred":87,"timestamp":"2021-04-16T19:32:43"}]}|
      |{"id":1,"identifier":"abc123","cost":[{"actual_cost":24,"cost_incurred":21,"timestamp":"2021-04-16T19:07:00"},{"actual_cost":37,"cost_incurred":39,"timestamp":"2021-04-16T19:26:30"}]}|
      |{"id":3,"identifier":"abc567","cost":[{"actual_cost":87,"cost_incurred":85,"timestamp":"2021-04-16T19:13:00"}]}                                                                        |
      +---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      

      如果你想在一行那么

      result.agg(to_json(collect_list(struct(result.columns.map(col): _*))).as("hits"))
      .show(false)
      

      结果:

      +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      |hits                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
      +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      |[{"id":2,"identifier":"xyz987","cost":[{"actual_cost":12,"cost_incurred":34,"timestamp":"2021-04-16T19:25:27"},{"actual_cost":92,"cost_incurred":87,"timestamp":"2021-04-16T19:32:43"}]},{"id":1,"identifier":"abc123","cost":[{"actual_cost":24,"cost_incurred":21,"timestamp":"2021-04-16T19:07:00"},{"actual_cost":37,"cost_incurred":39,"timestamp":"2021-04-16T19:26:30"}]},{"id":3,"identifier":"abc567","cost":[{"actual_cost":87,"cost_incurred":85,"timestamp":"2021-04-16T19:13:00"}]}]|
      +-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
      

      【讨论】:

        猜你喜欢
        • 2019-07-05
        • 2019-07-08
        • 1970-01-01
        • 2020-12-09
        • 2021-08-27
        • 2017-03-21
        • 2020-12-16
        • 1970-01-01
        相关资源
        最近更新 更多