【问题标题】:Query to find records that where created one after another in bigquery查询查找在bigquery中一个接一个创建的记录
【发布时间】:2019-07-22 21:02:33
【问题描述】:

我正在玩 bigquery。给出以下输入:

+---------------+---------+---------+--------+----------------------+
|   customer    |  agent  |  value  |  city  |   timestamp          |
+---------------+---------+---------+--------+----------------------+
| 1             | 1       |  106    | LA     |  2019-02-12 03:05pm  |
| 1             | 1       |  251    | LA     |  2019-02-12 03:06pm  |
| 3             | 2       |  309    | NY     |  2019-02-12 06:41pm  |
| 1             | 1       |  654    | LA     |  2019-02-12 05:12pm  |
+---------------+---------+---------+--------+----------------------+

我想查找由同一个代理一个接一个(比如说在 5 分钟内)发出的交易。所以上表的输出应该是这样的:

+---------------+---------+---------+--------+----------------------+
|   customer    |  agent  |  value  |  city  |   timestamp          |
+---------------+---------+---------+--------+----------------------+
| 1             | 1       |  106    | LA     |  2019-02-12 03:05pm  |
| 1             | 1       |  251    | LA     |  2019-02-12 03:06pm  |
+---------------+---------+---------+--------+----------------------+

查询应该以某种方式按代理分组并找到此类事务。但是,正如您从输出中看到的那样,结果并没有真正分组。我的第一个想法是使用 LEAD 功能,但我不确定。你有什么想法吗?

查询思路:

  • 按代理和时间戳 DESC 排序
  • 从第一行开始,看下一行(使用 LEAD?)
  • 检查时间戳差异是否小于 5 分钟
  • 如果是这样,这两行应该在输出中
  • 继续下 (2nd) 行

当第 2 行和第 3 行也符合条件时,第 2 行将进入输出,这会导致重复行。我还不确定如何避免这种情况。

【问题讨论】:

  • 我可能误解了数据,但您能不能简单地按agenttimestamp 排序?
  • 是的,这可能是第一步。排序后,必须查看第一行,查看第二行,看看时间戳差异是否小于 5 分钟并且客户是否相同。这应该对所有行重复。

标签: google-bigquery


【解决方案1】:

一定有更简单的方法,但这能达到你的目标吗?

CTE2 AS (
SELECT customer, agent, value, city, timestamp,
  lead(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lead,
  lead(customer,1) OVER (PARTITION BY agent ORDER BY timestamp) customer_lead,
  lead(value,1) OVER (PARTITION BY agent ORDER BY timestamp) value_lead,
  lead(city,1) OVER (PARTITION BY agent ORDER BY timestamp) city_lead,
  lag(timestamp,1) OVER (PARTITION BY agent ORDER BY timestamp) timestamp_lag
FROM CTE
)

SELECT agent, 
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(customer as string),', ',cast(customer_lead as string)),cast(customer as string)) customer, 
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(value as string),', ',cast(value_lead as string)),cast(value as string)) value,
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(city as string),', ',cast(city_lead as string)),cast(city as string)) cities,
  if(timestamp_diff(timestamp_lead,timestamp,MINUTE)<5, concat(cast(timestamp as string),', ',cast(timestamp_lead as string)),cast(timestamp as string)) timestamps
FROM CTE2
WHERE (timestamp_diff(timestamp_lead,timestamp,MINUTE)<5 OR NOT timestamp_diff(timestamp,timestamp_lag,MINUTE)<5)

【讨论】:

  • 这看起来真的很不错!谢谢!串联真的很有意义。但是,为 customer 和 customer_lead 设置一个列会更好。
  • 当然。当 timestamp_diff(timestamp_lead,timestamp,MINUTE)
【解决方案2】:

以下是 BigQuery 标准 SQL

#standardSQL
SELECT * FROM (
  SELECT *, 
    IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5, 
      LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts), 
    NULL).* 
  FROM `project.dataset.yourtable`
)
WHERE NOT next_customer IS NULL

您可以使用您问题中的示例数据进行测试,如以下示例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 1 customer, 1 agent, 106 value,'LA' city, '2019-02-12 03:05pm' ts UNION ALL
  SELECT 1, 1, 251,'LA', '2019-02-12 03:06pm' UNION ALL
  SELECT 3, 2, 309,'NY', '2019-02-12 06:41pm' UNION ALL
  SELECT 1, 1, 654,'LA', '2019-02-12 05:12pm' 
), temp AS (
  SELECT customer, agent, value, city, PARSE_TIMESTAMP('%Y-%m-%d %I:%M%p', ts) ts 
  FROM `project.dataset.table`
)
SELECT * FROM (
  SELECT *, 
    IF(TIMESTAMP_DIFF(LEAD(ts) OVER(PARTITION BY agent ORDER BY ts), ts, MINUTE) < 5, 
      LEAD(STRUCT(customer AS next_customer, value AS next_value)) OVER(PARTITION BY agent ORDER BY ts), 
    NULL).* 
  FROM temp
)
WHERE NOT next_customer IS NULL
-- ORDER BY ts

结果

Row customer    agent   value   city    ts                      next_customer   next_value   
1   1           1       106     LA      2019-02-12 15:05:00 UTC 1               251  

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2020-06-14
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2012-08-16
    相关资源
    最近更新 更多