【问题标题】:How to get the most frequent value in Google's Bigquery如何在 Google 的 Bigquery 中获取最频繁的值
【发布时间】:2019-05-08 19:43:38
【问题描述】:

Postgres 有一个简单的函数来实现这一点,只需使用mode() 函数我们就可以找到最频繁的值。在 Google 的 Bigquery 中有没有类似的东西?

如何在 Bigquery 中编写这样的查询?

select count(*),
       avg(vehicles)                                         as mean,
       percentile_cont(0.5) within group (order by vehicles) as median,
       mode() within group (order by vehicles)               as most_frequent_value
FROM "driver"
WHERE vehicles is not null;

【问题讨论】:

    标签: google-bigquery


    【解决方案1】:

    以下是 BigQuery 标准 SQL

    选项一

    #standardSQL
    SELECT * FROM (
      SELECT COUNT(*) AS cnt,
        AVG(vehicles) AS mean,
        APPROX_TOP_COUNT(vehicles, 1)[OFFSET(0)].value AS most_frequent_value
      FROM `project.dataset.table`
      WHERE vehicles IS NOT NULL
    ) CROSS JOIN (
      SELECT PERCENTILE_CONT(vehicles, 0.5) OVER() AS median
      FROM `project.dataset.table`
      WHERE vehicles IS NOT NULL
      LIMIT 1
    )
    

    选项 2

    #standardSQL
    SELECT * FROM (
      SELECT COUNT(*) cnt,
        AVG(vehicles) AS mean
      FROM `project.dataset.table`
      WHERE vehicles IS NOT NULL
    ) CROSS JOIN (
      SELECT PERCENTILE_CONT(vehicles, 0.5) OVER() AS median
      FROM `project.dataset.table`
      WHERE vehicles IS NOT NULL
      LIMIT 1
    ) CROSS JOIN (
      SELECT vehicles AS most_frequent_value
      FROM `project.dataset.table`
      WHERE vehicles IS NOT NULL
      GROUP BY vehicles
      ORDER BY COUNT(1) DESC
      LIMIT 1
    )  
    

    选项 3

    #standardSQL
    CREATE TEMP FUNCTION median(arr ANY TYPE) AS ((
      SELECT PERCENTILE_CONT(x, 0.5) OVER() 
      FROM UNNEST(arr) x LIMIT 1 
    ));
    CREATE TEMP FUNCTION most_frequent_value(arr ANY TYPE) AS ((
      SELECT x 
      FROM UNNEST(arr) x
      GROUP BY x
      ORDER BY COUNT(1) DESC
      LIMIT 1  
    ));
    SELECT COUNT(*) cnt,
      AVG(vehicles) AS mean,
      median(ARRAY_AGG(vehicles)) AS median,
      most_frequent_value(ARRAY_AGG(vehicles)) AS most_frequent_value
    FROM `project.dataset.table`
    WHERE vehicles IS NOT NULL   
    

    等等……

    【讨论】:

      【解决方案2】:

      您可以使用APPROX_TOP_COUNT 获取最高值,例如:

      SELECT APPROX_TOP_COUNT(vehicles, 5) AS top_five_vehicles
      FROM dataset.driver
      

      如果你只想要顶部的值,你可以从数组中选择它:

      SELECT APPROX_TOP_COUNT(vehicles, 1)[OFFSET(0)] AS most_frequent_value
      FROM dataset.driver
      

      【讨论】:

      • 如果你只想要值,追加.value - 该函数返回一个带有值和计数的结构。
      【解决方案3】:

      我更喜欢的方法是从数组中查询,因为您可以轻松调整模式的标准。下面是两个同时使用偏移量和限制方法的示例。使用偏移量,您可以获取第 N 个最频繁/最不频繁的值。

      WITH t AS (SELECT 18 AS length, 
      'HIGH' as amps, 
      99.95 price UNION ALL
      SELECT 18,  "HIGH", 99.95 UNION ALL
      SELECT 18,  "HIGH", 5.95 UNION ALL
      SELECT 18,  "LOW", 33.95 UNION ALL
      SELECT 18,  "LOW", 33.95 UNION ALL
      SELECT 18,  "LOW", 4.5 UNION ALL
      SELECT 3,  "HIGH", 77.95 UNION ALL
      SELECT 3,  "HIGH", 77.95 UNION ALL
      SELECT 3,  "HIGH", 9.99 UNION ALL
      SELECT 3,  "LOW", 44.95 UNION ALL
      SELECT 3,  "LOW", 44.95 UNION ALL
      SELECT 3,  "LOW", 5.65 
      )
      
      SELECT
      length,
      amps,
      
      -- By Limit
      (SELECT x FROM UNNEST(price_array) x 
          GROUP BY x ORDER BY COUNT(*) DESC LIMIT 1 ) most_freq_price,
      (SELECT x FROM UNNEST(price_array) x 
          GROUP BY x ORDER BY COUNT(*) ASC  LIMIT 1 ) least_freq_price,
      
      -- By Offset
      ARRAY((SELECT x FROM UNNEST(price_array) x 
          GROUP BY x ORDER BY COUNT(*) DESC))[OFFSET(0)] most_freq_price_offset,
      ARRAY((SELECT x FROM UNNEST(price_array) x 
          GROUP BY x ORDER BY COUNT(*) ASC))[OFFSET(0)] least_freq_price_offset
      
      FROM (
      SELECT 
          length,
          amps,
          ARRAY_AGG(price) price_array
      FROM t
      GROUP BY 1,2
      )
      

      【讨论】:

        【解决方案4】:

        不,BigQuery 中没有与 mode()-function 等效的函数,但您可以自己定义一个,使用此线程其他答案中的任何逻辑。你可以这样称呼它:

        SELECT mode(`an_array`) AS top_count FROM `somewhere_with_arrays`
        

        但是这种方法会导致多个逐行子查询,这对性能很不利,因此,如果您以前从未停止过 BQ,则可以使用这些函数来完成。我(第二个)只是为了快速修复非常小的数据集的可读性。

        查看下面的两个 UDF:s。第三种方法是实现一个 JS 函数,在这种情况下,这个 oneliner 应该很有用

        return arr.sort((a,b) => arr.filter(v => v===a).length - arr.filter(v => v===b).length).pop();
        

        这段代码建立了两个类似mode()的函数,它们吃数组并返回最常见的字符串:

        CREATE TEMPORARY FUNCTION mode1(mystring ANY TYPE)
        RETURNS STRING
        AS
        (
            (
                SELECT var FROM
                (   /* Count occurances of each value of input */ 
                    SELECT var, COUNT(*) AS n FROM 
                        (   /* Unnest and name*/
                            SELECT var FROM UNNEST(mystring) var
                        )
                        GROUP BY var    /* Output is one of existing values */
                        ORDER BY n DESC /* Output is value with HIGHEST n   */
                )                       /* -------------------------------- */
            LIMIT 1                     /* Only ONE string is the output    */
            )
        );
        
        CREATE TEMPORARY FUNCTION mode2(inp ANY TYPE)
        RETURNS STRING
        AS
        (
            (
                SELECT result.value FROM UNNEST( (SELECT APPROX_TOP_COUNT(v,1) AS result FROM UNNEST(inp) v)) result
            )
        );
        
        SELECT
            inp,
            mode1(inp) AS first_logic_output,
            mode2(inp) AS second_logic_output
        FROM
        (
            /* Test data */
            SELECT ['Erdős','Turán', 'Erdős','Turán','Euler','Erdős'] AS inp
            UNION ALL 
            SELECT ['Euler','Euler', 'Gauss', 'Euler'] AS inp
        )
        

        【讨论】:

          猜你喜欢
          • 1970-01-01
          • 2013-02-14
          • 2018-07-13
          • 1970-01-01
          • 2023-04-09
          • 1970-01-01
          • 1970-01-01
          • 2020-11-22
          • 2022-01-23
          相关资源
          最近更新 更多