【发布时间】:2019-09-02 12:48:46
【问题描述】:
我使用的是 Spark 2.1。我有一个具有此架构的数据框:
scala> df.printSchema
|-- id: integer (nullable = true)
|-- sum: integer (nullable = true)
|-- distribution: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- lower: integer (nullable = true)
| | |-- upper: integer (nullable = true)
| | |-- count: integer (nullable = true)
我要聚合:
- 按“id”列分组
- “分布”中“sum”和“count”的总和(按“lower”和“upper”分组)
在这里我不能分解数据框,因为我会有重复的行并且不能做“sum”列的总和。一种可能性是分别对分布进行求和和聚合,然后通过“id”加入,但用户定义的函数会更简单。
作为输入,我有:
scala> df.show(false)
+---+---+------------------------------------------------------------+
|id |sum|distribution |
+---+---+------------------------------------------------------------+
|1 |1 |[[0,1,2]] |
|1 |1 |[[1,2,5]] |
|1 |7 |[[0,1,1], [1,2,6]] |
|1 |7 |[[0,1,5], [1,2,1], [2,3,1]] |
|2 |1 |[[0,1,1]] |
|2 |2 |[[0,1,1], [1,2,1]] |
|2 |1 |[[0,1,1]] |
|2 |1 |[[2,3,1]] |
|2 |1 |[[0,1,1]] |
|2 |4 |[[0,1,1], [1,2,1], [2,3,1], [3,4,1]] |
+---+---+------------------------------------------------------------+
预期输出:
+---+---+------------------------------------------------------------+
|id |sum|distribution |
+---+---+------------------------------------------------------------+
|1 |16 |[[0,1,8], [1,2,12], [2,3,1]] |
|2 |10 |[[0,1,5], [1,2,2], [2,3,3], [3,4,1]] |
+---+---+------------------------------------------------------------+
【问题讨论】:
标签: scala apache-spark user-defined-functions distribution