【问题标题】:How to count (non) consecutive records by day in bigquery?如何在bigquery中按天计算(非)连续记录?
【发布时间】:2019-03-26 14:26:44
【问题描述】:

记录设备的每个故障。每个条目都包含一个 customer_id、device_id 和时间戳:

+-------------+-----------+-----------------------+
| customer_id | device_id |  timestamp            |
+-------------+-----------+-----------------------+
| 1           | 1         |  2019-02-12T01:00:00  |
| 2           | 2         |  2019-02-12T01:00:00  |
| 1           | 1         |  2019-02-12T02:00:00  |
| 1           | 1         |  2019-02-12T03:00:00  |
+-------------+-----------+-----------------------+

每小时收集一次故障日志。我对以下信息感兴趣:

  • 每位客户每天的故障总数
  • 每位客户每天的连续故障次数
  • 每位客户每天的非连续故障数

设备可能出现故障数小时,这可能表示硬件故障。另一方面,如果设备出现不超过数小时的故障,则可能只是设备使用不当。

结果应该是这样的:

+-------------+-----------+---------------------+-----------------+------------+-----------------------+
| customer_id | device_id | total | consecutive | non consecutive |  day       | last_recording        |
+-----+-------------------+-------+-------------+-----------------+------------------------------------+
| 1           | 1         | 3     |  1          | 2               | 2019-02-12 |  2019-02-12T03:00:00  |
| 2           | 2         | 1     |  0          | 1               | 2019-02-12 |  2019-02-12T01:00:00  |
+-------------+-----------+-------+-------------+-----------------+------------+-----------------------+

在上面的示例中,设备 1 在 2019-02-12T02:00:00 报告了一个故障,这被认为是“非连续的”,然后在 2019-02-12T03:00:00 报告了另一个故障,这被认为是“连续”。

我想创建一个查询,生成这样的结果。我试过的

SELECT customer_id, device_id, COUNT(customer_id) AS count, FORMAT_TIMESTAMP("%Y-%m-%d", TIMESTAMP(timestamp)) as day
FROM `malfunctions`
GROUP BY day, customer_id, device_id

这样我可以得到客户每天的故障总数。我想我必须使用 LEAD 运算符来获得(非)连续计数,但我不确定如何。有任何想法吗?结果应该是按天“滚动”的。

【问题讨论】:

    标签: google-bigquery


    【解决方案1】:

    以下是 BigQuery 标准 SQL

    #standardSQL
    SELECT customer_id, device_id, day, SUM(batch_count) total, 
      SUM(batch_count) - COUNTIF(batch_count = 1) consecutive,
      COUNTIF(batch_count = 1) non_consecutive, 
      ARRAY_AGG(STRUCT(batch AS batch, batch_count AS batch_count, first_recording AS first_recording, last_recording AS last_recording)) details
    FROM (
      SELECT customer_id, device_id, day, batch, 
        COUNT(1) batch_count,
        MIN(ts) first_recording,
        MAX(ts) last_recording
      FROM (
        SELECT customer_id, device_id, ts, day,
          COUNTIF(gap) OVER(PARTITION BY customer_id, device_id, day ORDER BY  ts) batch
        FROM (
          SELECT customer_id, device_id, ts, DATE(ts) day,
            IFNULL(TIMESTAMP_DIFF(ts, LAG(ts) OVER(PARTITION BY customer_id, device_id, DATE(ts) ORDER BY  ts), HOUR), 777) > 1 gap
          FROM `project.dataset.malfunctions`
        )
      )
      GROUP BY customer_id, device_id, day, batch
    )
    GROUP BY customer_id, device_id, day
    

    你可以像下面的例子一样使用虚拟数据测试,玩上面的例子

    #standardSQL
    WITH `project.dataset.malfunctions` AS (
      SELECT 1 customer_id, 1 device_id, TIMESTAMP '2019-02-12T01:00:00' ts UNION ALL
      SELECT 1, 1, '2019-02-12T02:00:00' UNION ALL
      SELECT 1, 1, '2019-02-12T03:00:00' UNION ALL
      SELECT 1, 1, '2019-02-12T04:00:00' UNION ALL
      SELECT 1, 1, '2019-02-12T09:00:00' UNION ALL
      SELECT 1, 1, '2019-02-12T10:00:00' UNION ALL
      SELECT 1, 1, '2019-02-13T03:00:00' UNION ALL
      SELECT 2, 2, '2019-02-12T01:00:00' 
    )
    SELECT customer_id, device_id, day, SUM(batch_count) total, 
      SUM(batch_count) - COUNTIF(batch_count = 1) consecutive,
      COUNTIF(batch_count = 1) non_consecutive, 
      ARRAY_AGG(STRUCT(batch AS batch, batch_count AS batch_count, first_recording AS first_recording, last_recording AS last_recording)) details
    FROM (
      SELECT customer_id, device_id, day, batch, 
        COUNT(1) batch_count,
        MIN(ts) first_recording,
        MAX(ts) last_recording
      FROM (
        SELECT customer_id, device_id, ts, day,
          COUNTIF(gap) OVER(PARTITION BY customer_id, device_id, day ORDER BY  ts) batch
        FROM (
          SELECT customer_id, device_id, ts, DATE(ts) day,
            IFNULL(TIMESTAMP_DIFF(ts, LAG(ts) OVER(PARTITION BY customer_id, device_id, DATE(ts) ORDER BY  ts), HOUR), 777) > 1 gap
          FROM `project.dataset.malfunctions`
        )
      )
      GROUP BY customer_id, device_id, day, batch
    )
    GROUP BY customer_id, device_id, day
    -- ORDER BY customer_id, device_id, day
    

    结果

    【讨论】:

      猜你喜欢
      • 2018-12-01
      • 2020-05-18
      • 2017-04-20
      • 1970-01-01
      • 2022-01-08
      • 2021-11-07
      • 2011-02-16
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多