【发布时间】:2023-04-02 06:52:01
【问题描述】:
现在我有一张桌子,上面有以下任务:
- 按 DepartmentID 和 EmployeeID 上的函数分组
- 在每个组中,我需要按 (ArrivalDate, ArrivalTime) 排序并选择第一个。因此,如果两个日期不同,请选择较新的日期。如果两个日期相同,请选择较新的时间。
我正在尝试这种方法:
input.select("DepartmenId","EmolyeeID", "ArrivalDate", "ArrivalTime", "Word")
.agg(here will be the function that handles logic from 2)
.show()
这里聚合的语法是什么?
提前谢谢你。
// +-----------+---------+-----------+-----------+--------+
// |DepartmenId|EmolyeeID|ArrivalDate|ArrivalTime| Word |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E1 | 20170101 | 0730 | "YES" |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E1 | 20170102 | 1530 | "NO" |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E2 | 20170101 | 0730 | "ZOO" |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E2 | 20170102 | 0330 | "BOO" |
// +-----------+---------+-----------+-----------+--------+
// | D2 | E1 | 20170101 | 0730 | "LOL" |
// +-----------+---------+-----------+-----------+--------+
// | D2 | E1 | 20170101 | 1830 | "ATT" |
// +-----------+---------+-----------+-----------+--------+
// | D2 | E2 | 20170105 | 1430 | "UNI" |
// +-----------+---------+-----------+-----------+--------+
// output should be
// +-----------+---------+-----------+-----------+--------+
// |DepartmenId|EmolyeeID|ArrivalDate|ArrivalTime| Word |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E1 | 20170102 | 1530 | "NO" |
// +-----------+---------+-----------+-----------+--------+
// | D1 | E2 | 20170102 | 0330 | "BOO" |
// +-----------+---------+-----------+-----------+--------+
// | D2 | E1 | 20170101 | 1830 | "ATT" |
// +-----------+---------+-----------+-----------+--------+
// | D2 | E2 | 20170105 | 1430 | "UNI" |
// +-----------+---------+-----------+-----------+--------+
【问题讨论】:
标签: scala apache-spark apache-spark-sql