【发布时间】:2022-01-20 06:08:26
【问题描述】:
我一直在研究如何在纯 SparkSQL 中(仅)在理想情况下使用现有的 sparkSQL ARRAY 函数和 lambda 逻辑来减少基于时间的更改(由 LineId 支持)的销售线数组,减少到仅'latest' (MAX(Occurred)) by key (LineId) - 即每行给出当前状态。
我真的希望使用特定 ARRAY 函数(SLICE、FILTER、REDUCE 等 - 也许需要 SQL UDF?)而不是 classic SQL取消嵌套,ROW_NUMBER OVER ...,过滤where rowNum = 1,将剩余的重新聚合回逻辑数组类型。
这样的结果应该是第二个数组,将最初的 5 行数组减少到重复数据和最新的 3 行。
非常感谢
[
{
"OrdNum": "SalesOrderHeader_H1",
"LineId": "SalesOrderLine_L2",
"Occurred": "2021-12-01T00:00:00.000+0000",
"Prod": "P2",
"Amt": 34,
"Qty": 2
},
{
"OrdNum": "SalesOrderHeader_H1",
"LineId": "SalesOrderLine_L1",
"Occurred": "2021-12-01T00:00:00.000+0000",
"Prod": "P1",
"Amt": 13,
"Qty": 2
},
{
"OrdNum": "SalesOrderHeader_H1",
"LineId": "SalesOrderLine_L2",
"Occurred": "2021-12-02T00:00:00.000+0000",
"Prod": "P2",
"Amt": 17,
"Qty": 1
},
{
"OrdNum": "SalesOrderHeader_H1",
"LineId": "SalesOrderLine_L3",
"Occurred": "2021-12-02T00:00:00.000+0000",
"Prod": "P3",
"Amt": 100,
"Qty": 5
},
{
"OrdNum": "SalesOrderHeader_H1",
"LineId": "SalesOrderLine_L3",
"Occurred": "2021-12-03T00:00:00.000+0000",
"Prod": "P3",
"Amt": 60,
"Qty": 3
}
]
结果应该只剩下 3 个唯一行(SalesOrderLine_L1、SalesOrderLine_L2、SalesOrderLine_L3 为 1 行),其中每行都是“最新”/最新行 MAX(Occurred)。 L2 和 L3 都有一个额外的行,每个都会被过滤掉。
[
{
"OrdNum": "SalesOrderHeader_H1",
"LineId": "SalesOrderLine_L1",
"Occurred": "2021-12-01T00:00:00.000+0000",
"Prod": "P1",
"Amt": 13,
"Qty": 2
},
{
"OrdNum": "SalesOrderHeader_H1",
"LineId": "SalesOrderLine_L2",
"Occurred": "2021-12-02T00:00:00.000+0000",
"Prod": "P2",
"Amt": 17,
"Qty": 1
},
{
"OrdNum": "SalesOrderHeader_H1",
"LineId": "SalesOrderLine_L3",
"Occurred": "2021-12-03T00:00:00.000+0000",
"Prod": "P3",
"Amt": 60,
"Qty": 3
}
]
【问题讨论】:
标签: apache-spark apache-spark-sql