【问题标题】:Big query distinct on and group by大查询不同和分组依据
【发布时间】:2018-11-25 03:38:15
【问题描述】:

Select first row in each GROUP BY group? 开始,我正在尝试在 Google 大查询中做类似的事情。

数据集:fh-bigquery:reddit_cmets.2018_01

目标:对于每个 link_id(Reddit 提交),选择第一个以 created_utc 表示的评论

SELECT body,link_id 
FROM [fh-bigquery:reddit_comments.2018_01] 
where subreddit_id == "t5_2zkvo"  
group by  link_id ,body, created_utc  
order by link_id ,body,  created_utc desc 

目前它不起作用,因为它仍然没有给我唯一/不同的 parent_id(s)

请,谢谢!


编辑: 我说parent_id是==提交时我是不正确的,它实际上是link_id

【问题讨论】:

    标签: sql google-bigquery reddit


    【解决方案1】:

    以下是 BigQuery 标准 SQL

    #standardSQL
    SELECT 
      ARRAY_AGG(body ORDER BY created_utc LIMIT 1)[OFFSET(0)] body, 
      link_id
    FROM `fh-bigquery.reddit_comments.2018_01`
    WHERE subreddit_id = 't5_2zkvo'
    GROUP BY link_id
    -- ORDER BY link_id
    

    【讨论】:

      【解决方案2】:

      我们可以在这里使用ROW_NUMBER()

      SELECT body, parent_id, created_utc
      FROM
      (
          SELECT *, ROW_NUMBER() OVER (PARTITION BY parent_id ORDER BY created_utc) rn
          FROM [fh-bigquery:reddit_comments.2018_01]
          WHERE subreddit_id = 't5_2zkvo'
      ) t
      WHERE rn = 1
      ORDER BY parent_id ,body, created_utc DESC;
      

      请注意,您可以继续当前的方法,但是您必须将查询表述为表和子查询之间的连接,该子查询为每个评论找到最早的条目:

      SELECT t1.*
      FROM [fh-bigquery:reddit_comments.2018_01] t1
      INNER JOIN
      (
          SELECT parent_id, MIN(created_utc) AS first_created_utc
          FROM [fh-bigquery:reddit_comments.2018_01]
          GROUP BY parent_id
      ) t2
          ON t1.parent_id = t2.parent_id AND t1.created_utc = t2.first_created_utc;
      

      【讨论】:

        猜你喜欢
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2013-09-14
        • 2018-03-26
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        相关资源
        最近更新 更多