【问题标题】:Aggregate time series data by duration in BigQuery在 BigQuery 中按持续时间聚合时间序列数据
【发布时间】:2018-06-20 13:58:46
【问题描述】:

我正在尝试将 InfluxDB 查询迁移到 Google Cloud BigQuery。

InfluxDB 是一个时间序列数据库,因此按时间间隔进行聚合非常容易。鉴于此数据集:

name: h2o_feet
--------------
time                   water_level   location
2015-08-18T00:00:00Z   8.12          coyote_creek
2015-08-18T00:00:00Z   2.064         santa_monica
2015-08-18T00:06:00Z   8.005         coyote_creek
2015-08-18T00:06:00Z   2.116         santa_monica
2015-08-18T00:12:00Z   7.887         coyote_creek
2015-08-18T00:12:00Z   2.028         santa_monica
2015-08-18T00:18:00Z   7.762         coyote_creek
2015-08-18T00:18:00Z   2.126         santa_monica
2015-08-18T00:24:00Z   7.635         coyote_creek
2015-08-18T00:24:00Z   2.041         santa_monica
2015-08-18T00:30:00Z   7.5           coyote_creek
2015-08-18T00:30:00Z   2.051         santa_monica

以下查询将查询结果分组为 12 分钟间隔:

SELECT COUNT("water_level") FROM "h2o_feet" WHERE "location"='coyote_creek' AND time >= '2015-08-18T00:00:00Z' AND time <= '2015-08-18T00:30:00Z' GROUP BY time(12m)

name: h2o_feet
--------------
time                   count
2015-08-18T00:00:00Z   2
2015-08-18T00:12:00Z   2
2015-08-18T00:24:00Z   2

有谁知道 BigQuery 中是否有直接等效的 GROUP BY time(12m) 部分?

劳伦特

【问题讨论】:

    标签: google-bigquery time-series aggregate


    【解决方案1】:

    BigQuery 中没有直接的等价物,但您可以在 Issue Tracker 中提交功能请求

    同时,以下是我认为的解决方法

    选项一

    #standardSQL
    SELECT MIN(time) time, COUNT(1) cnt
    FROM `project.dataset.h2o_feet`
    WHERE location = 'coyote_creek' 
    AND time BETWEEN '2015-08-18T00:00:00' AND '2015-08-18T00:30:00'
    GROUP BY DIV(DATETIME_DIFF(time, '2015-08-18T00:00:00', MINUTE), 12)
    

    选项 2

    更冗长的版本(不确定为什么我会在第一个选项上使用以下选项 - 但可能是为了试验代码)

    #standardSQL
    WITH start_finish AS (
      SELECT DATETIME '2015-08-18T00:00:00' start, DATETIME '2015-08-18T00:30:00' finish, DATETIME '2000-01-01T00:00:00' base
    ), intervals AS (
      SELECT pos1, pos2,
        DATETIME_ADD(base, INTERVAL start_interval MINUTE) start,
        DATETIME_ADD(base, INTERVAL finish_interval MINUTE) finish
      FROM (
        SELECT DATETIME_DIFF(start, base, MINUTE) start,
          DATETIME_DIFF(finish, base, MINUTE) finish,
          base
        FROM start_finish
      ), UNNEST(GENERATE_ARRAY(start, finish, 12)) start_interval WITH OFFSET pos1,
      UNNEST(GENERATE_ARRAY(start, finish + 12, 12)) finish_interval WITH OFFSET pos2
      WHERE pos1 = pos2 - 1 
    )
    SELECT start, COUNT(1) cnt
    FROM `project.dataset.h2o_feet`
    JOIN intervals
    ON time >= start AND time < finish
    WHERE location = 'coyote_creek' 
    GROUP BY start
    

    start_finish CTE 中,您只需设置startfinish 时间 - 其余部分由查询完成

    您可以使用以下问题中的虚拟数据测试/玩上述内容

    #standardSQL
    WITH `project.dataset.h2o_feet` AS (
      SELECT DATETIME '2015-08-18T00:00:00' time, 8.12 water_level, 'coyote_creek' location UNION ALL
      SELECT DATETIME '2015-08-18T00:00:00', 2.064, 'santa_monica' UNION ALL
      SELECT DATETIME '2015-08-18T00:06:00', 8.005, 'coyote_creek' UNION ALL
      SELECT DATETIME '2015-08-18T00:06:00', 2.116, 'santa_monica' UNION ALL
      SELECT DATETIME '2015-08-18T00:12:00', 7.887, 'coyote_creek' UNION ALL
      SELECT DATETIME '2015-08-18T00:12:00', 2.028, 'santa_monica' UNION ALL
      SELECT DATETIME '2015-08-18T00:18:00', 7.762, 'coyote_creek' UNION ALL
      SELECT DATETIME '2015-08-18T00:18:00', 2.126, 'santa_monica' UNION ALL
      SELECT DATETIME '2015-08-18T00:24:00', 7.635, 'coyote_creek' UNION ALL
      SELECT DATETIME '2015-08-18T00:24:00', 2.041, 'santa_monica' UNION ALL
      SELECT DATETIME '2015-08-18T00:30:00', 7.5, 'coyote_creek' UNION ALL
      SELECT DATETIME '2015-08-18T00:30:00', 2.051, 'santa_monica' 
    ), start_finish AS (
      SELECT DATETIME '2015-08-18T00:00:00' start, DATETIME '2015-08-18T00:30:00' finish, DATETIME '2000-01-01T00:00:00' base
    ), intervals AS (
      SELECT pos1, pos2,
        DATETIME_ADD(base, INTERVAL start_interval MINUTE) start,
        DATETIME_ADD(base, INTERVAL finish_interval MINUTE) finish
      FROM (
        SELECT DATETIME_DIFF(start, base, MINUTE) start,
          DATETIME_DIFF(finish, base, MINUTE) finish,
          base
        FROM start_finish
      ), UNNEST(GENERATE_ARRAY(start, finish, 12)) start_interval WITH OFFSET pos1,
      UNNEST(GENERATE_ARRAY(start, finish + 12, 12)) finish_interval WITH OFFSET pos2
      WHERE pos1 = pos2 - 1 
    )
    SELECT start, COUNT(1) cnt
    FROM `project.dataset.h2o_feet`
    JOIN intervals
    ON time >= start AND time < finish
    WHERE location = 'coyote_creek' 
    GROUP BY start
    -- ORDER BY start  
    

    两个版本都产生以下结果

    Row     start                   cnt  
    1       2015-08-18T00:00:00     2    
    2       2015-08-18T00:12:00     2    
    3       2015-08-18T00:24:00     2    
    

    选项 3 - (愚蠢的一个 - 但使其看起来类似于 GROUP BY time(12m) 和来自问题的原始查询

    #standardSQL
    CREATE TEMP FUNCTION duration(time DATETIME) AS ((
      DIV(DATETIME_DIFF(time, '2015-08-18T00:00:00', MINUTE), 12)
    ));
    SELECT MIN(time) time, COUNT(1) cnt
    FROM `project.dataset.h2o_feet`
    WHERE location = 'coyote_creek' 
    AND time BETWEEN '2015-08-18T00:00:00' AND '2015-08-18T00:30:00'
    GROUP BY duration(time)
    ORDER BY time
    

    【讨论】:

    • 非常感谢您抽出宝贵时间 Mikhail。选项 1 就像一个魅力,我从其他选项中学到了很多。也感谢BigQuery Mate 扩展!
    • 另外,很高兴听到您正在使用 BQ Mate - 谢谢! :o)
    猜你喜欢
    • 1970-01-01
    • 2018-02-28
    • 2018-01-14
    • 1970-01-01
    • 2021-07-18
    • 2014-10-18
    • 2022-11-22
    • 2021-04-02
    • 2016-06-14
    相关资源
    最近更新 更多