【问题标题】:Detect Outliers using BigQuery with Standard Deviation使用带有标准偏差的 BigQuery 检测异常值
【发布时间】:2018-10-25 07:33:02
【问题描述】:

我目前在 BigQuery 中有一个包含一些异常值的表

示例表:

port - qty - datetime
--------------------------------
TCP1 - 13 - 2018/06/11 11:20:23
UDP2 - 15 - 2018/06/11 11:24:24
TCP3 - 14 - 2018/06/11 11:24:27
TCP1 - 2  - 2018/06/11 11:24:26 
UDP2 - 15 - 2018/06/11 11:35:32
TCP3 - 13 - 2018/06/11 11:45:23
TCP3 - 14 - 2018/06/11 11:54:22
TCP3 - 30 - 2018/06/11 11:55:33

我希望能够使用 SQL 和标准差在 2018/06/11 筛选出各个端口上的异常值

结果:

TCP1 - 2  - 2018/06/11 11:24:26
TCP3 - 30 - 2018/06/11 11:55:33

我做了一些研究,发现标准差能够帮助筛选出异常值。但是,我不知道如何编写 SQL 查询来完成这项工作。任何帮助将不胜感激。

(这是我能找到的关于这个主题的最接近的线程:Using BigQuery to find outliers with standard deviation results combined with WHERE clause

【问题讨论】:

    标签: statistics google-bigquery standard-deviation


    【解决方案1】:

    以下示例适用于 BigQuery 标准 SQL

    #standardSQL
    WITH stats AS (
      SELECT DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime)) dt,
        AVG(qty) - 1.5 * STDDEV(qty) down,
        AVG(qty) + 1.5 * STDDEV(qty) up
      FROM `project.dataset.table`
      GROUP BY dt
    )
    SELECT port, qty, datetime 
    FROM `project.dataset.table`
    JOIN stats 
    ON dt = DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime))
    WHERE NOT qty BETWEEN down AND up  
    

    您可以使用您问题中的虚拟数据进行测试,使用上面的内容:

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT 'TCP1' port, 13 qty, '2018/06/11 11:20:23' datetime UNION ALL
      SELECT 'UDP2', 15, '2018/06/11 11:24:24' UNION ALL
      SELECT 'TCP3', 14, '2018/06/11 11:24:27' UNION ALL
      SELECT 'TCP1', 2 , '2018/06/11 11:24:26' UNION ALL 
      SELECT 'UDP2', 15, '2018/06/11 11:35:32' UNION ALL
      SELECT 'TCP3', 13, '2018/06/11 11:45:23' UNION ALL
      SELECT 'TCP3', 14, '2018/06/11 11:54:22' UNION ALL
      SELECT 'TCP3', 30, '2018/06/11 11:55:33' 
    ), stats AS (
      SELECT DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime)) dt,
        AVG(qty) - 1.5 * STDDEV(qty) down,
        AVG(qty) + 1.5 * STDDEV(qty) up
      FROM `project.dataset.table`
      GROUP BY dt
    )
    SELECT port, qty, datetime 
    FROM `project.dataset.table`
    JOIN stats 
    ON dt = DATE(PARSE_TIMESTAMP('%Y/%m/%d %T', datetime))
    WHERE NOT qty BETWEEN down AND up  
    

    结果为

    Row port    qty datetime     
    1   TCP1    2   2018/06/11 11:24:26  
    2   TCP3    30  2018/06/11 11:55:33  
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2018-03-29
      • 1970-01-01
      • 2014-05-13
      • 1970-01-01
      • 2018-09-07
      • 2018-08-06
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多