【问题标题】:How to convert rdd to nested json in pyspark如何在pyspark中将rdd转换为嵌套的json
【发布时间】:2019-08-26 10:16:53
【问题描述】:

我是新手,我有以下格式的数据

类别、子类别、名称

Food,Thai,Restaurant A
Food,Thai,Restaurant B
Food, Chinese, Restaurant C
Lodging, Hotel, Hotel A

我希望数据采用以下格式

{Category : Food , Subcategories : [ {subcategory : Thai , names : [Restaurant A , Restaurant B] }, {subcategory : Chinese , names : [Restaurant C]}]}

{Category : Hotel , Subcategories : [ {subcategory : Lodging , names : [Hotel A] }]}

有人可以帮助我如何使用 pyspark RDD 解决这个问题吗?

谢谢!

【问题讨论】:

    标签: json apache-spark pyspark apache-spark-sql rdd


    【解决方案1】:

    这里有帮助的解决方案:

    创建一个窗口函数来收集名称 groupBy Category 和 Subcategory

      from pyspark.sql import functions as F
      from pyspark.sql import Window
    
      groupByCateWind = Window.partitionBy("Category", "Subcategory")
    
        finalDf = df.withColumn("names", F.collect_list("Name").over(groupByCateWind)) \
            .withColumn("Subcategories", F.struct("Subcategory", "names")) \
            .groupBy("Category").agg(F.collect_set("Subcategories").alias("Subcategories")).toJSON()
    
    1. 在Window函数上方收集名称groupBy

    2. 使用 Subcategory 和 names 列创建具有 Struct 类型的 Subcategories 列。

    3. 再次按类别分组并收集子类别列值。

    输出如下:

    +---------------------------------------------------------------------------------------------------------------------------------------------------------+
    |{"Category":"Food","Subcategories":[{"Subcategory":"Thai","names":["Restaurant A","Restaurant B"]},{"Subcategory":" Chinese","names":[" Restaurant C"]}]}|
    |{"Category":"Lodging","Subcategories":[{"Subcategory":" Hotel","names":[" Hotel A"]}]}                                                                   |
    +---------------------------------------------------------------------------------------------------------------------------------------------------------+
    

    【讨论】:

      猜你喜欢
      • 2020-04-02
      • 2020-12-06
      • 1970-01-01
      • 1970-01-01
      • 2021-06-29
      • 2018-09-14
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多