【问题标题】:LIMIT per group - Google BigQuery/Standard SQL每组限制 - Google BigQuery/标准 SQL
【发布时间】:2019-06-14 17:01:03
【问题描述】:

我有一个如下表(例如here):

CREATE TABLE topics (
  name varchar(64),
  url varchar(253),
  statistic integer,
  pubdate timestamp
);

INSERT INTO topics VALUES
('a',  'b',  100,  TIMESTAMP '2011-05-16 15:36:38'),  
('a',  'c',  110,  TIMESTAMP '2014-04-01 00:00:00'),  
('a',  'd',  120,  TIMESTAMP '2014-04-01 00:00:00'),  
('a',  'e',  90,   TIMESTAMP '2011-05-16 15:36:38'), 
('a',  'f',  80,   TIMESTAMP '2014-04-01 00:00:00'), 
('a',  'g',  70,   TIMESTAMP '2011-05-16 15:36:38'), 
('a',  'h',  150,  TIMESTAMP '2014-04-01 00:00:00'),  
('a',  'i',  50,   TIMESTAMP '2011-05-16 15:36:38'), 
('b',  'j',  10,   TIMESTAMP '2014-04-01 00:00:00'), 
('b',  'k',  11,   TIMESTAMP '2011-05-16 15:36:38'), 
('b',  'l',  12,   TIMESTAMP '2014-04-01 00:00:00'), 
('b',  'm',  9,    TIMESTAMP '2011-05-16 15:36:38'),
('b',  'n',  8,    TIMESTAMP '2014-04-01 00:00:00'),
('b',  'o',  7,    TIMESTAMP '2011-05-16 15:36:38'),
('b',  'p',  15,   TIMESTAMP '2014-04-01 00:00:00'), 
('b',  'q',  5,    TIMESTAMP '2011-05-16 15:36:38'),
('b',  'r',  2,    TIMESTAMP '2014-04-01 00:00:00')

我想根据每个 (name, date(pubdate)) 组合中的 statistic 值获取前两行。

换句话说,我想GROUP BY name, date(pubdate),但没有聚合函数,而是根据每个组的statistic 简单地获取前两行。 (所以,我知道它不是真正的GROUP BY,而是greatest-n-per-group。)

我正在使用带有标准 SQL 的 Google Big Query。我查看了其他一些solutions,但不确定在这种情况下如何实现结果。

想要的结果:

name    url     statistic   date

a       b       100         2011-05-16
a       e       90          2011-05-16

a       h       150         2014-04-01
a       d       120         2014-04-01

b       m       9           2011-05-16
b       k       11          2011-05-16

b       l       12          2014-04-01
b       p       15          2014-04-01

【问题讨论】:

  • 您是否有一个带有主键的列,因为即使使用ORDER BY statistic,结果仍然可能是非确定性(随机)因为统计列值不是唯一的
  • 您的意思是,在statistic 的“关系”的情况下,不能保证结果是确定性的? @RaymondNijland
  • “你的意思是,在统计数据的“关系”的情况下,结果不能保证是确定性的?“ 是的,这就是我的意思@BradSolomon 理想情况下你应该使用ORDER BY some_column, <some_columns_with_primary_key>

标签: sql google-bigquery greatest-n-per-group


【解决方案1】:

以下是 BigQuery 标准 SQL

#standardSQL
SELECT * EXCEPT(arr) FROM (
  SELECT name, DATE(pubdate) day, 
    ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) arr
  FROM `project.dataset.table`   
  GROUP BY name, day
), UNNEST(arr)
-- ORDER BY name, day  

您可以使用问题中的示例数据进行测试,如以下示例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'a' name, 'b' url,  100 statistic,  TIMESTAMP '2011-05-16 15:36:38' pubdate UNION ALL  
  SELECT 'a', 'c',  110,  '2014-04-01 00:00:00' UNION ALL  
  SELECT 'a', 'd',  120,  '2014-04-01 00:00:00' UNION ALL  
  SELECT 'a', 'e',  90,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'a', 'f',  80,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'a', 'g',  70,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'a', 'h',  150,  '2014-04-01 00:00:00' UNION ALL  
  SELECT 'a', 'i',  50,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'b', 'j',  10,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'b', 'k',  11,   '2011-05-16 15:36:38' UNION ALL 
  SELECT 'b', 'l',  12,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'b', 'm',  9,    '2011-05-16 15:36:38' UNION ALL
  SELECT 'b', 'n',  8,    '2014-04-01 00:00:00' UNION ALL
  SELECT 'b', 'o',  7,    '2011-05-16 15:36:38' UNION ALL
  SELECT 'b', 'p',  15,   '2014-04-01 00:00:00' UNION ALL 
  SELECT 'b', 'q',  5,    '2011-05-16 15:36:38' UNION ALL
  SELECT 'b', 'r',  2,    '2014-04-01 00:00:00' 
)
SELECT * EXCEPT(arr) FROM (
  SELECT name, DATE(pubdate) day, 
    ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) arr
  FROM `project.dataset.table`  
  GROUP BY name, day
), UNNEST(arr)
ORDER BY name, day   

结果

Row name    day         url statistic    
1   a       2011-05-16  b   100  
2   a       2011-05-16  e   90   
3   a       2014-04-01  h   150  
4   a       2014-04-01  d   120  
5   b       2011-05-16  k   11   
6   b       2011-05-16  m   9    
7   b       2014-04-01  p   15   
8   b       2014-04-01  l   12   

【讨论】:

  • 很好,这看起来正是我想要的。
  • @BradSolomon - 很高兴它有帮助!考虑投票然后:o) 如果还没有
  • @BradSolomon 请注意我关于确定性结果的评论,因为这不能保证始终为纯确定性(固定)结果,因为statistic 列不是唯一的......不确定它对您的用户案例有多重要。
  • 感谢@RaymondNijland,但在这种情况下,我愿意接受非常罕见的关系的不确定性
  • 没关系,明白了 - 只需在 STRUCT 调用中包含 pubdate 并从顶部排除 select
【解决方案2】:

使用ARRAY_AGG函数:

SELECT
  name,
  DATE(pubdate) AS pubdate,
  ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) AS top_urls
FROM dataset.table
GROUP BY name, pubdate

您可以使用带有UNNEST 的子查询来获取不带数组的行作为输出:

SELECT name, pubdate, url, statistic
FROM (
  SELECT
    name,
    DATE(pubdate) AS pubdate,
    ARRAY_AGG(STRUCT(url, statistic) ORDER BY statistic DESC LIMIT 2) AS top_urls
  FROM dataset.table
  GROUP BY name, pubdate
), UNNEST(top_urls)

【讨论】:

    【解决方案3】:
        with xx as(
          select name, url, statistic, pubdate, row_number() over(partition by name , url order by statistic desc) rn 
          from topics)
    select * except(rn) 
    from xx 
    where rn <= 2;
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 2017-11-12
      • 1970-01-01
      • 2010-09-26
      • 1970-01-01
      • 2017-11-07
      • 1970-01-01
      • 2018-08-27
      相关资源
      最近更新 更多