使用窗口函数计算 Hive 中的滚动每周支出答案

【问题标题】：Calculating Rolling Weekly Spend in Hive using Window Functions使用窗口函数计算 Hive 中的滚动每周支出
【发布时间】：2019-10-29 10:41:10
【问题描述】：

我需要制定客户一周的支出分布。每次客户进行购买时，我都想知道他们在过去一周在我们这里花了多少钱。我想用我的 Hive 代码来做这个。

我的数据集和这个有点类似：

Spend_Table

Cust_ID | Purch_Date | Purch_Amount  
1 | 1/1/19 | $10  
1 | 1/2/19 | $21  
1 | 1/3/19 | $30  
1 | 1/4/19 | $11  
1 | 1/5/19 | $21  
1 | 1/6/19 | $31  
1 | 1/7/19 | $41  
2 | 1/1/19 | $12  
2 | 1/2/19 | $22  
2 | 1/3/19 | $32  
2 | 1/5/19 | $42  
2 | 1/7/19 | $52  
2 | 1/9/19 | $62  
2 | 1/11/19 | $72

到目前为止，我已经尝试过类似下面的代码：

Select Cust_ID, 
Purch_Date, 
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date) range between 604800 and current row) as Rolling_Spend
from Spend_Table



Cust_ID | Purch_Date | Purch_Amount | Rolling_Spend  
1 | 1/1/19 | $10 | $10  
1 | 1/2/19 | $21 | $31  
1 | 1/3/19 | $30 | $61  
1 | 1/4/19 | $11 | $72  
1 | 1/5/19 | $21 | $93  
1 | 1/6/19 | $31 | $124  
1 | 1/7/19 | $41 | $165  
2 | 1/1/19 | $12 | $12  
2 | 1/2/19 | $22 | $34  
2 | 1/3/19 | $32 | $66  
2 | 1/5/19 | $42 | $108  
2 | 1/7/19 | $52 | $160  
2 | 1/9/19 | $62 | $188  
2 | 1/11/19 | $72 | $228

我认为问题出在我的范围之间，因为它似乎在抓取前面的行数。我希望它能够在前面几秒内抓取数据（604800 是 6 天，以秒为单位）。

我正在尝试做的事情可行吗？我不能做前 6 行，因为不是每个客户每天都会购买，就像客户 2 一样。非常感谢任何帮助！

【问题讨论】：

标签： hadoop hive window-functions partition

【解决方案1】：

SELECT *, sum(some_value) OVER (
        PARTITION BY Cust_ID 
        ORDER BY CAST(Purch_Date AS timestamp) 
        RANGE BETWEEN INTERVAL 7 DAYS PRECEDING AND CURRENT ROW
     ) AS cummulativeSum FROM Spend_Table

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

【讨论】：

我尝试运行代码，但收到以下错误：ParseException line 13:129 cannot identify input near 'INTERVAL' '7' 'DAYS' in windowframeboundary 我正在使用 Hue 运行 Hive 代码
您需要将日期转换为时间戳
我将演员表作为时间戳记在那里（仔细检查以确保）但仍然收到相同的错误消息
取决于您运行的 Hive 版本？ issues.apache.org/jira/browse/HIVE-10911
@Achyuth 和 mazaneicha 感谢你们的帮助。不确定我运行的是哪个 Hive 版本，但我一直在尝试使用原始查询，并且能够通过添加 unix_timestamp(Purch_date,'MM-dd-yyyy') 使其运行。用解决方案更新了我原来的帖子

【解决方案2】：

从问题中移出答案，

我能够通过将原始代码更改为：
Select Cust_ID, 
Purch_Date, 
Purch_Amount,
sum(Purch_Amount) over (partition by Cust_ID order by unix_timestamp(Purch_Date, 'MM-dd-yyyy') range between 604800 and
当前行）作为 Rolling_Spend 来自 Spend_Table

关键是在 unix_timestamp 公式中指定日期格式

【讨论】：