【问题标题】:How to fill missing values in certain time interval如何在特定时间间隔内填充缺失值
【发布时间】:2020-12-03 07:34:00
【问题描述】:

我有以下格式的表格

user  timestamp              count  total_count

xyz   01-01-2020 00:12:00    45        45
xyz   01-01-2020 00:27:00    12        57
xyz   01-01-2020 00:29:00    11        68
xyz   01-01-2020 00:53:00    32        100

我希望数据以 5 分钟为间隔,如下所示(预期输出)

user  timestamp              count  total_count

xyz   01-01-2020 00:05:00    0         0
xyz   01-01-2020 00:10:00    0         0
xyz   01-01-2020 00:15:00    45        45
xyz   01-01-2020 00:20:00    0         45
xyz   01-01-2020 00:25:00    0         45
xyz   01-01-2020 00:30:00    23        68
xyz   01-01-2020 00:35:00    0         68
xyz   01-01-2020 00:40:00    0         68
xyz   01-01-2020 00:45:00    0         68
xyz   01-01-2020 00:50:00    0         68
xyz   01-01-2020 00:55:00    32        100

我试过了

   SELECT
        TIMESTAMP_SECONDS(5*60 * DIV(UNIX_SECONDS(timestamp), 5*60)) timekey,
        SUM(count) AS count,
        MAX(total_count) as total_count
   FROM db.table
   WHERE
        timestamp BETWEEN {{ start_date }}
        AND {{ end_date }}
        AND user = {{ user_id }}
   GROUP BY
        timekey
   ORDER BY
        timekey

以上查询结果:

user  timestamp              count  total_count

xyz   01-01-2020 00:15:00    45        45
xyz   01-01-2020 00:30:00    23        68
xyz   01-01-2020 00:55:00    32        100

如何在上述查询中填充那些缺失的时间戳并填充 count(零)和 total_count(以前的非空值)的值?

【问题讨论】:

    标签: sql google-bigquery


    【解决方案1】:

    使用generate_timestamp_array()填写缺失值:

    SELECT ts,
           SUM(t.count) AS count,
           MAX(t.total_count) as total_count
    FROM UNNEST(GENERATE_TIMESTAMP_ARRAY( {{start_date}}, {{end_date}}, INTERVAL 5 minute)) ts LEFT JOIN
         db.table t
         ON t.timestamp >= ts AND
            t.timestamp < TIMESTAMP_ADD(ts, INTERVAL 5 minute) AND
            t.user = {{ user_id }}
    GROUP BY ts
    ORDER BY ts;
    

    如果需要按表分区,可以稍微修改查询:

    SELECT ts,
           SUM(t.count) AS count,
           MAX(t.total_count) as total_count
    FROM UNNEST(GENERATE_TIMESTAMP_ARRAY( {{start_date}}, {{end_date}}, INTERVAL 5 minute)) ts LEFT JOIN
         (SELECT t.*
          FROM db.table t
          WHERE timestamp BETWEEN {{ start_date }} AND {{ end_date }}
         ) t
         ON t.timestamp >= ts AND
            t.timestamp < TIMESTAMP_ADD(ts, INTERVAL 5 minute) AND
            t.user = {{ user_id }}
    GROUP BY ts
    ORDER BY ts;
    

    【讨论】:

    • 我有 timestamp 作为分区键,所以我收到错误 Cannot query over table 'db.table' without a filter over column(s) 'timestamp' that can be used for partition elimination
    • @Sociopath 。 . .然后添加where子句,如where timestamp &gt;= {{start_date]} and timestamp &lt;= {{end_date}}
    猜你喜欢
    • 2020-02-03
    • 1970-01-01
    • 2013-01-18
    • 2018-11-10
    • 1970-01-01
    • 1970-01-01
    • 2017-05-06
    • 2021-11-12
    • 2019-01-29
    相关资源
    最近更新 更多