【问题标题】:BigQuery missing rows with SUM OVER PARTITION BYBigQuery 缺少 SUM OVER PARTITION BY 的行
【发布时间】:2021-02-19 10:06:57
【问题描述】:

TL;DR:

鉴于此表:

WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
  UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
  UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
  UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
  UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
)

如何获得一个表格,其中包含缺少的日期/产品组合 (2020-11-02 - premium) 以及 diff 的后备值 0

理想情况下,适用于多种产品。所有产品的列表可以这样得到:

SELECT ARRAY_AGG(DISTINCT product) FROM subscriptions

我希望能够获取所有产品或某些产品的每日订阅数。

我认为可以轻松实现这一点的方法是准备一个如下所示的数据库:

|---------------------|------------------|------------------|
|         date        |      product     |       total      |
|---------------------|------------------|------------------|
|      2020-11-01     |      premium     |        100       |
|---------------------|------------------|------------------|
|      2020-11-01     |       basic      |        50        |
|---------------------|------------------|------------------|

有了这张表,我可以很容易地按日期和产品分组,或者只按日期和总和。

在获得结果表之前,我已经生成了一个表,在该表中我计算了每天和产品的订阅差异。每个产品有多少新订阅者,有多少不再订阅。

此表如下所示:

|---------------------|------------------|------------------|
|         date        |      product     |       diff       |
|---------------------|------------------|------------------|
|      2020-11-01     |      premium     |        50        |
|---------------------|------------------|------------------|
|      2020-11-01     |       basic      |       -20        |
|---------------------|------------------|------------------|

意味着11月1日高级用户总数增加了50个,基本用户总数减少了20个。

现在的问题是,如果一个产品没有任何更改,则此临时表缺少日期点,请参见下面的示例。


当我开始时没有产品表,我只有日期和差异列。

为了从第二个表到第一个表,我使用了这个完美的查询:

WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, 150 as diff
  UNION ALL SELECT TIMESTAMP("2020-11-02"), -10
  UNION ALL SELECT TIMESTAMP("2020-11-03"), 60
)
SELECT 
  *,
  SUM(diff) OVER (ORDER BY date) as total_subscriptions
FROM subscriptions
ORDER BY date

但是当我添加产品列并尝试计算每天和产品的总和时,缺少一些数据点。

WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
  UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
  UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
  UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
  UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
)
SELECT 
  *,
  SUM(diff) OVER (PARTITION BY product ORDER BY date) as total_subscriptions
FROM subscriptions
ORDER BY date

--

|---------------------|------------------|------------------|
|         date        |      product     |      total       |
|---------------------|------------------|------------------|
|      2020-11-01     |       basic      |       100        |
|---------------------|------------------|------------------|
|      2020-11-01     |      premium     |        50        |
|---------------------|------------------|------------------|
|      2020-11-02     |       basic      |        90        |
|---------------------|------------------|------------------|
|      2020-11-03     |       basic      |       130        |
|---------------------|------------------|------------------|
|      2020-11-03     |      premium     |        70        |
|---------------------|------------------|------------------|

如果我现在显示每天的订阅总数,我会得到:

150 -> 90 -> 200

但我希望:

150 -> 140 -> 200

每天的高级订阅总数也是如此:

50 -> 0 -> 70

但我希望:

50 -> 50 -> 70


我认为解决此问题的最佳选择是添加缺少的日期/产品组合。

我该怎么做?

【问题讨论】:

  • 请编辑您的问题并显示您想要的结果。
  • 预期输出 - 请澄清!

标签: sql datetime google-bigquery sum recursive-query


【解决方案1】:

如果我没听错的话,一种方法是生成一个固定的日期列表,用于您想要的时间段,然后cross join 与产品列表一起生成。这为您提供了所有可能的组合。然后,可以带left join的订阅表,最后进行窗口求和:

select d.dt, p.product, sum(s.diff) over(partition by p.product order by d.dt) total
from unnest(generate_timestamp_array(
    timestamp('2020-11-01'), 
    timestamp('2020-11-03'), 
    interval 1 day)
) dt
cross join (
    select 'basic' product 
    union all select 'premium'
) p
left join subscriptions on s.product = p.product and s.date = dt

我们可以通过动态生成日期范围和产品列表来使查询更通用:

select d.dt, p.product, sum(s.diff) over(partition by p.product order by d.dt) total
from (select min(date) min_dt, max(date) max_dt from subscriptions) d0
cross join unnest(generate_timestamp_array(d0.min_dt, d0.max_dt, interval 1 day)) dt
cross join (select distinct product from subscriptions) p
left join subscriptions on s.product = p.product and s.date = dt

【讨论】:

    【解决方案2】:

    使用GENERATE_TIMESTAMP_ARRAY:

    WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
      UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
      UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
      UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
      UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
    ),
    dates AS (
      SELECT * 
      FROM UNNEST(GENERATE_TIMESTAMP_ARRAY('2020-11-01 00:00:00', '2020-11-03 00:00:00', INTERVAL 1 DAY)) as date
    ),
    products AS (
      SELECT DISTINCT product FROM subscriptions
    )
    SELECT dates.date, products.product, subscriptions.diff
    FROM dates 
    CROSS JOIN products
    LEFT JOIN subscriptions 
    ON subscriptions.date = dates.date AND subscriptions.product = products.product
    

    【讨论】:

      【解决方案3】:
            -- Try this,I am creating a table for list of products and add total product in that list. Joining with your table to get data as per your requirement.
            WITH subscriptions AS (SELECT TIMESTAMP("2020-11-01") as date, "premium" as product, 50 as diff
              UNION ALL SELECT TIMESTAMP("2020-11-01"), "basic", 100
              UNION ALL SELECT TIMESTAMP("2020-11-02"), "basic", -10
              UNION ALL SELECT TIMESTAMP("2020-11-03"), "premium", 20
              UNION ALL SELECT TIMESTAMP("2020-11-03"), "basic", 40
            ),
      
            product_name as (
            Select product from subscriptions group by 1
            union all
            Select "Total" as product
            )
      
            Select date
                  ,product
                  ,total_subscriptions
            from (      
            Select a.date
                  ,a.product
                  ,diff
                  ,SUM(diff) OVER (PARTITION BY a.product ORDER BY a.date) as total_subscriptions
            from 
            (
            Select date,a.product
            from product_name A
             join subscriptions B
             on 1=1
             where a.product !='Total'
            group by 1,2
            ) A
            left join subscriptions B 
            on A.product = B.product
            and A.date = B.date
            group by 1,2,3
            ) group by 1,2,3
            union all
            Select date
                  ,product
                  ,total_subscriptions
            from 
            (
            Select date,a.product
                  ,diff
                  ,SUM(diff) OVER (PARTITION BY a.product ORDER BY date) as total_subscriptions
            from product_name A
             join subscriptions B
             on 1=1
             where a.product ='Total'
            group by 1,2,3
            ) group by 1,2,3
            order by 1,2
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2021-05-06
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2019-02-26
        相关资源
        最近更新 更多