【问题标题】:BigQuery - Remove duplicates from arrayBigQuery - 从数组中删除重复项
【发布时间】:2020-03-26 10:58:30
【问题描述】:

使用 BigQuery,我想通过一个查询根据标题对页面进行分组,并计算组的不同指标。由于标题的规则不是相互排斥的,所以我是这样做的:

SELECT SUM(views) views, title_group
FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
CROSS JOIN
UNNEST([
    CASE WHEN (title LIKE '%game%') 
    THEN 'games_group' END, 
    CASE WHEN (title LIKE '%sport%') 
    THEN 'sports_group' END
]) AS title_group
WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
GROUP BY title_group

结果如下:

views       ...   title_group
3414469869  ... 
4355264     ...   games_group
1361074     ...   sports_group

但是,不属于任何组的页面的浏览量数字 3414469869 是错误的。实际上,当标题不包含“游戏”(或“运动”)时,我们会得到UNNEST([null, "sports_group"])(或UNNEST(["games_group", null])),因此我们仍然计算空组的观看次数。当标题既不包含“游戏”也不包含“运动”时,观看次数甚至会被计算两次。

有没有办法从数组中删除重复项?

【问题讨论】:

    标签: sql google-bigquery


    【解决方案1】:

    添加另一个组怎么样?

    SELECT SUM(views) views, title_group
    FROM `fh-bigquery.wikipedia_v3.pageviews_2019` CROSS JOIN
         UNNEST([CASE WHEN title LIKE '%game%' THEN 'games_group' END, 
                 CASE WHEN title LIKE '%sport%' THEN 'sports_group' END,
                 CASE WHEN title NOT LIKE '%game%' AND title NOT LIKE '%sport%' THEN 'Neither' END
                ]) AS title_group
    WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
          wiki = 'en' AND
          title_group IS NOT NULL
    GROUP BY title_group;
    

    注意:这不考虑 NULL 标题。我不知道这是否重要。

    但是,我会使用两列来表达这一点:

    SELECT (title LIKE '%game%') as is_game,
           (title LIKE '%sport%') as is_sport,
           SUM(views)
    FROM `fh-bigquery.wikipedia_v3.pageviews_2019`
    WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
          wiki = 'en' AND
          title_group IS NOT NULL
    GROUP BY is_game, is_sport;
    

    这不会返回与您相同的行 - 游戏和运动分为两行。但是你可以看到组合。

    编辑:

    现在我想到了,你只想要一个LEFT JOIN

    SELECT g.title_group, SUM(pv.views) as views, 
    FROM `fh-bigquery.wikipedia_v3.pageviews_2019` pv LEFT JOIN
         (SELECT '%game%' as pattern, 'games_group' as title_group UNION ALL
          SELECT '%sport%', 'sports_group' as title_group UNION ALL
         ) g
         ON pv.title LIKE g.pattern
    WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10' AND
          wiki = 'en' AND
    GROUP BY g.title_group;
    

    【讨论】:

    • 是的,添加另一个组可能是一个解决方案!对于第二个查询,我不能使用它,因为我真的只需要一列。
    • @丽贝卡。 . .我认为编辑后的解决方案是您真正想要的。
    【解决方案2】:

    以下是 BigQuery 标准 SQL

    #standardSQL
    SELECT SUM(views) views, title_group
    FROM `fh-bigquery.wikipedia_v3.pageviews_2019`,
    UNNEST(
        CASE WHEN REGEXP_CONTAINS(title, r'game|sport') THEN 
          [
            CASE WHEN (title LIKE '%game%') THEN 'games_group' END,
            CASE WHEN (title LIKE '%sport%') THEN 'sports_group' END
          ]
          ELSE ['other']
        END
    ) AS title_group
    WHERE DATE(datehour) BETWEEN '2019-01-01' AND '2019-01-10'AND wiki='en'
    AND   title_group IS NOT NULL
    GROUP BY title_group
    

    【讨论】:

      猜你喜欢
      • 2022-01-09
      • 2011-06-29
      • 1970-01-01
      • 2011-01-04
      • 1970-01-01
      相关资源
      最近更新 更多