【发布时间】:2019-07-31 02:17:37
【问题描述】:
我有这个数据框显示每个用户的发送时间和打开时间:
val df = Seq(("user1", "2018-04-05 15:00:00", "2018-04-05 15:50:00"),
("user1", "2018-04-05 16:00:00", "2018-04-05 16:50:00"),
("user1", "2018-04-05 17:00:00", "2018-04-05 17:50:00"),
("user1", "2018-04-05 18:00:00", "2018-04-05 18:50:00"),
("user2", "2018-04-05 15:00:00", "2018-04-05 15:50:00"),
("user2", "2018-04-05 16:00:00", "2018-04-05 16:50:00"),
("user2", "2018-04-05 17:00:00", "2018-04-05 17:50:00"),
("user2", "2018-04-05 17:30:00", "2018-04-05 17:40:00"),
("user2", "2018-04-05 18:00:00", null),
("user2", "2018-04-05 19:00:00", null)
).toDF("id", "sendTime", "openTime")
+-----+-------------------+-------------------+
| id| sendTime| openTime|
+-----+-------------------+-------------------+
|user1|2018-04-05 15:00:00|2018-04-05 15:50:00|
|user1|2018-04-05 16:00:00|2018-04-05 16:50:00|
|user1|2018-04-05 17:00:00|2018-04-05 17:50:00|
|user1|2018-04-05 18:00:00|2018-04-05 18:50:00|
|user2|2018-04-05 15:00:00|2018-04-05 15:50:00|
|user2|2018-04-05 16:00:00|2018-04-05 16:50:00|
|user2|2018-04-05 17:00:00|2018-04-05 17:50:00|
|user2|2018-04-05 17:30:00|2018-04-05 17:40:00|
|user2|2018-04-05 18:00:00| null|
|user2|2018-04-05 19:00:00| null|
+-----+-------------------+-------------------+
现在我想计算从每个用户的每次发送时间开始的过去两个小时内发生的打开次数。我使用窗口函数按用户进行分区,但我不知道如何比较 sendTime 列与 openTime 列中的值。结果数据框应如下所示:
+-----+-------------------+-------------------+-----+
| id| sendTime| openTime|count|
+-----+-------------------+-------------------+-----+
|user1|2018-04-05 15:00:00|2018-04-05 15:50:00| 0|
|user1|2018-04-05 16:00:00|2018-04-05 16:50:00| 1|
|user1|2018-04-05 17:00:00|2018-04-05 17:50:00| 2|
|user1|2018-04-05 18:00:00|2018-04-05 18:50:00| 2|
|user2|2018-04-05 15:00:00|2018-04-05 15:50:00| 0|
|user2|2018-04-05 16:00:00|2018-04-05 16:50:00| 1|
|user2|2018-04-05 17:00:00|2018-04-05 17:50:00| 2|
|user2|2018-04-05 17:30:00|2018-04-05 17:40:00| 2|
|user2|2018-04-05 18:00:00| null| 3|
|user2|2018-04-05 19:00:00| null| 2|
+-----+-------------------+-------------------+-----+
这是我所能得到的,但没有给我我需要的东西:
var df2 = df.withColumn("sendUnix", F.unix_timestamp($"sendTime")).withColumn("openUnix", F.unix_timestamp($"openTime"))
val w = Window.partitionBy($"id").orderBy($"sendUnix").rangeBetween(-2*60*60, 0)
df2 = df2.withColumn("count", F.count($"openUnix").over(w))
【问题讨论】:
标签: scala apache-spark apache-spark-sql