【问题标题】:how to do subqueries in bigquery?如何在 bigquery 中进行子查询?
【发布时间】:2016-06-03 23:03:55
【问题描述】:

我正在尝试使用 bigquery 上的 reddit 数据,我想在一行中查看 cmets 和回复。我看到 bigquery 支持子查询,但我无法构造查询。由于数据的结构,我必须使用子查询来自连接同一个表,特别是我想将 id 和 parent_id 连接在一起,但我需要修改 id 才能加入。这是我尝试进行查询的方式:

SELECT 
  p.subreddit, 
  p.body AS first_body,
  p.score AS first_score,
  CONCAT('t1_',p.id) AS first_id ,
  c.last_body,
  c.last_score,
  c.last_id 
FROM 
[fh-bigquery:reddit_comments.2016_01] p,
(
  SELECT 
    body AS last_body,
    score AS last_score,
    CONCAT('t1_',id) AS last_id,
    parent_id,
    author,
    body 
  FROM  [fh-bigquery:reddit_comments.2016_01] 
  WHERE body != '[deleted]' 
  AND author != '[deleted]' 
  AND score > 1
)  c
WHERE  p.first_id = c.parent_id  
AND p.score > 1 
AND  p.author != '[deleted]' 
AND p.body != '[deleted]';

我得到的错误是:

Field 'c.parent_id' not found in table 'fh-bigquery:reddit_comments.2016_01'; did you mean 'parent_id'?

您可以在此处运行查询: https://bigquery.cloud.google.com/table/fh-bigquery:reddit_comments.2016_01

我不确定如何解决这个问题。加入这个并让这个查询运行的正确方法是什么?

【问题讨论】:

    标签: sql subquery google-bigquery reddit bigdata


    【解决方案1】:

    您可能想要执行以下操作(只是猜测):

    SELECT 
      p.subreddit, 
      p.body AS first_body,
      p.score AS first_score,
      CONCAT('t1_',p.id) AS first_id ,
      c.last_body,
      c.last_score,
      c.last_id 
    FROM 
    [fh-bigquery:reddit_comments.2016_01] p
    JOIN (
      SELECT 
        body AS last_body,
        score AS last_score,
        CONCAT('t1_',id) AS last_id,
        parent_id,
        author,
        body 
      FROM  [fh-bigquery:reddit_comments.2016_01] 
      WHERE body != '[deleted]' 
      AND author != '[deleted]' 
      AND score > 1
    )  c
    ON  p.link_id = c.parent_id  
    WHERE p.score > 1 
    AND  p.author != '[deleted]' 
    AND p.body != '[deleted]'
    LIMIT 100
    

    查看更多关于JOINs

    请注意,我只是将您的查询转换为正确使用的 JOIN,但查询逻辑仍需您根据需要进行润色

    添加以解决您评论中的其他信息:

    SELECT 
      subreddit, 
      first_body,
      first_score,
      first_id ,
      last_body,
      last_score,
      last_id 
    FROM (
      SELECT 
        subreddit, 
        body AS first_body,
        score AS first_score,
        CONCAT('t1_',id) AS first_id 
      FROM [fh-bigquery:reddit_comments.2016_01]
      WHERE score > 1 
      AND author != '[deleted]' 
      AND body != '[deleted]'
    ) p
    JOIN (
      SELECT 
        body AS last_body,
        score AS last_score,
        CONCAT('t1_',id) AS last_id,
        parent_id,
        author,
        body 
      FROM  [fh-bigquery:reddit_comments.2016_01] 
      WHERE body != '[deleted]' 
      AND author != '[deleted]' 
      AND score > 1
    )  c
    ON  p.first_id = c.parent_id  
    LIMIT 100  
    

    【讨论】:

    • Mikhail,我不能专门使用这种样式,因为 join 子句。我需要加入: on concat('t1_',p.id) = c.parent_id 。 id 缺少前面的“t1_”字符串。 bigquery 不允许连接表中不存在的字段。所以我需要修改查询以使用我相信的子选择。这是我尝试使用 concat 时遇到的错误:查询失败错误:ON 子句必须是 AND of = 每个表中一个字段名称的比较,所有字段名称都以表名称为前缀。 .
    • 添加到我的原始答案中以反映您对 ON 子句的规范
    【解决方案2】:

    在 BigQuery 的 SQL 方言中,逗号表示 UNION ALL 而不是 JOIN。您需要使用 JOIN 关键字显式编写 JOIN。

    我还建议将连接的两边都推入子查询,以确保在执行连接之前应用所有过滤器。 (到目前为止,连接是查询中最昂贵的部分,因此首先应用过滤器将确保您的查询尽可能快地运行。)

    【讨论】:

      猜你喜欢
      • 2021-02-17
      • 1970-01-01
      • 2017-12-13
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2016-12-17
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多