【问题标题】:Spark SQL - aggregate columns into dictionarySpark SQL - 将列聚合到字典中
【发布时间】:2021-03-28 17:03:45
【问题描述】:

我有以下数据框:

spark.sql("""
SELECT id, color, cnt
FROM ( 
  VALUES ('A','green', 5),
         ('A','yellow', 4),
         ('A','yellow',2),
         ('B','blue', 3),
         ('B','green',4),
         ('B','blue',1) 
) as T (id, color, cnt)
""")

并希望以这样一种方式聚合它,即对于每个 id 键,我都有一个 cnt 列的计数和总和的字典。所以输出将是:

+---+-----------------------------------------+
| id|         color_cnt | color_sum           |
+---+-----------------------------------------+
|  B|{blue:2, green:1}  | {blue:4, green:4}   | 
|  A|{green:1, yellow:2}| {green:5, yellow:6} |
+---+-------------------+---------------------+

是否有任何 Spark SQL 函数可以帮助我实现所需的功能? 谢谢!

【问题讨论】:

    标签: sql apache-spark apache-spark-sql


    【解决方案1】:

    to_jsonmap_from_arrays 函数会对您有所帮助。如果您想要数据框中的地图类型,只需删除 to_json

    spark.sql("""
    SELECT
        id,
        to_json(map_from_arrays(collect_list(color), collect_list(count))) count,
        to_json(map_from_arrays(collect_list(color), collect_list(sum))) sum 
    FROM (
    
        SELECT id, color, count(1) count, sum(cnt) sum
        FROM ( 
          VALUES ('A','green', 5),
                 ('A','yellow', 4),
                 ('A','yellow',2),
                 ('B','blue', 3),
                 ('B','green',4),
                 ('B','blue',1) 
        ) as T (id, color, cnt)
        GROUP BY id, color)
    
    GROUP BY id
    """).show(truncate=False)
    
    +---+----------------------+----------------------+
    |id |count                 |sum                   |
    +---+----------------------+----------------------+
    |B  |{"blue":2,"green":1}  |{"blue":4,"green":4}  |
    |A  |{"yellow":2,"green":1}|{"yellow":6,"green":5}|
    +---+----------------------+----------------------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2015-08-11
      • 2018-01-11
      • 2023-04-02
      • 2016-02-26
      • 1970-01-01
      • 2019-07-24
      • 2017-04-11
      • 2017-03-10
      相关资源
      最近更新 更多