【问题标题】:Big Query Standard SQL using Partition By with the ARRAY_AGG() functionBig Query Standard SQL 使用 Partition By 和 ARRAY_AGG() 函数
【发布时间】:2019-02-14 05:17:50
【问题描述】:

我正在尝试使用 PARTITION BY 子句和 ARRAY_AGG() 函数将列折叠成数组。

我在 Big Query 中的标准 SQL 如下:

        WITH initial_30days
           AS (
          SELECT 
            date,
            fullvisitorId AS user_id,
            visitNumber, 
            CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
          FROM
            `my-data.XXXXXXX.ga_sessions_*`
            WHERE _TABLE_SUFFIX BETWEEN '20181004' AND  '20181103'
            GROUP BY 1,2,3,4
            )

          SELECT
            date,
            ARRAY_AGG(sessions) OVER (PARTITION BY date ROWS BETWEEN 5 PRECEDING 
            AND CURRENT ROW) AS agg_array
          FROM(

          SELECT
            date,
            user_id,
            COUNT(DISTINCT( session_id))  AS sessions
            FROM initial_30days
            GROUP BY date,user_id) 
            GROUP BY date,sessions

我的预期输出是

+----------+--------------------------+
|   date   |        agg_array         |
+----------+--------------------------+
| 20181004 | [34,21,34,21,6,7,4,43]   |
| 20181005 | [1,5,56,76,23,1,3,54,45] |
| 20181006 | [22,67,43,1,2,67,3,24]   |
| 20181007 | [34,21,34,21,6,7,4,43]   |
+----------+--------------------------+

我当前的输出看起来像这样,以一个日期值为例:

+----------+------------------------+
|   date   |       agg_array        |
+----------+------------------------+
| 20181004 | [34]                   |
| 20181004 | [34,21]                |
| 20181004 | [34,21,34]             |
| 20181004 | [34,21,34,21]          |
| 20181004 | [34,21,34,21,6]        |
| 20181004 | [34,21,34,21,6,7]      |
| 20181004 | [34,21,34,21,6,7,4]    |
| 20181004 | [34,21,34,21,6,7,4,43] |
+----------+------------------------+

您可以看到按日期分区的数组为该数组的每个值创建了一个增量行。

ARRAY_AGG() 函数应用的数据集如下所示:

+----------+------------------+----------+
|   date   |     user_id      | sessions |
+----------+------------------+----------+
| 20181004 | 2526262363754747 |       34 |
| 20181004 | 2525626325173256 |       21 |
| 20181004 | 7436783255747736 |       34 |
| 20181004 | 6526241526363536 |       21 |
| 20181004 | 4252636353637423 |        6 |
| 20181004 | 3636325636673563 |        7 |
+----------+------------------+----------+

我感觉它是因为我按上面的sessions 分组,但那是因为如果我不这样做,我会收到类似的验证错误:

    SELECT list expression references column sessions which is 
neither grouped nor aggregated at 

【问题讨论】:

    标签: sql google-bigquery


    【解决方案1】:

    以下是 BigQuery 标准 SQL

    只需在您的原始查询周围添加以下内容

    SELECT date, 
      ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
    FROM (
      ...   
      ...   
    )
    GROUP BY date   
    

    因此,整个内容将如下所示(并将产生所需的结果 - 同时保留您使用窗口函数的想法)

    #standardSQL
    WITH initial_30days AS (
      SELECT 
        date,
        fullvisitorId AS user_id,
        visitNumber, 
        CONCAT(fullvisitorid, CAST(VisitId AS STRING)) AS session_id
      FROM `my-data.XXXXXXX.ga_sessions_*`
      WHERE _TABLE_SUFFIX BETWEEN '20181004' AND  '20181103'
      GROUP BY 1,2,3,4
    )
    SELECT date, 
      ARRAY_AGG(STRUCT(agg_array) ORDER BY ARRAY_LENGTH(agg_array) DESC LIMIT 1)[OFFSET(0)].*
    FROM (
      SELECT
        date, 
        ARRAY_AGG(sessions) OVER(PARTITION BY date ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS agg_array
      FROM(
        SELECT
          date,
          user_id,
          COUNT(DISTINCT( session_id))  AS sessions
        FROM initial_30days
        GROUP BY date,user_id
      )
      GROUP BY date,sessions
    )
    GROUP BY date   
    

    【讨论】:

      【解决方案2】:

      如果您希望每个日期有一行,则需要GROUP BY date

      SELECT date,
             ARRAY_AGG(sessions) AS agg_array
      FROM (SELECT date, user_id,
                   COUNT(DISTINCT( session_id))  AS sessions
            FROM initial_30days
            GROUP BY date, user_id
           )  du
      GROUP BY date;
      

      如果您只需要一定数量的值,请将LIMIT 添加到ARRAY_AGG()。例如,如果您希望为 id 最小的用户提供 5 个会话,您可以这样做:

        ARRAY_AGG(sessions ORDER BY user_id LIMIT 5) AS agg_array
      

      【讨论】:

      • 谢谢,但脚本的底线有这个。
      • @Gyle 。 . .我不知道你指的是什么。问题中没有GROUP BY date
      • 感谢@Gordon Linoff 再次回复。在 SQL 脚本的最后一行,可以看到GROUP BY date,sessions。这就是我所指的。我只想 GROUP BY date 作为您的建议,但我收到了一个验证错误,正如我在问题底部详细说明的那样。
      • @Gkyle 。 . .您问题中的查询不是此答案中的查询。
      • 感谢@Gordon Linoff。虽然您的解决方案确实提供了各种预期输出,但我确实需要 OVER() 子句与 ARRAY_AGG() 一起使用,因为我正在尝试捕获移动窗口。在这种情况下,使用OVER() 强制要求按sessions 进行分组。
      猜你喜欢
      • 2015-11-04
      • 2019-11-14
      • 1970-01-01
      • 2021-03-09
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多