【问题标题】:Standard SQL - How to count frequency of values in array标准 SQL - 如何计算数组中值的频率
【发布时间】:2020-05-09 23:21:39
【问题描述】:

我得到下面的查询表:

SELECT 
  fullVisitorId,
  COUNT(fullVisitorId) as id_count,
  ARRAY_AGG(trafficSource.medium) AS trafic_medium
FROM 
  `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`
GROUP BY
  fullVisitorId
ORDER BY
  id_count DESC

对于trafic_medium 列中的每个值(例如:cpc、推荐、有机等),我试图找出每个值在数组中出现的频率,因此最好添加一个新列“计数”这表明该值发生的频率如何?

+-----------+---------+------+
| array_agg | medium  | count|
+-----------+---------+------+
| 123       | cpc     |   2  |
+-----------+---------+------+
|           | organic |   1  |
+-----------+---------+------+
|           | cpc     |   2  |
+-----------+---------+------+
| 456       | organic |   2  |
+-----------+---------+------+
|           | organic |   2  |
+-----------+---------+------+
|           | cpc     |   1  |
+-----------+---------+------+

我是 SQL 新手,所以我很困惑。

到目前为止我已经试过了:

WITH medium AS
(
    SELECT 
        fullVisitorId,
        COUNT(fullVisitorId) as id_count,
        ARRAY_AGG(trafficSource.medium) AS trafic_medium
    FROM 
        `bigquery-public-data.google_analytics_sample.ga_sessions_20170101`
    GROUP BY
        fullVisitorId
    ORDER BY
        id_count DESC
) 
SELECT
    fullVisitorId,
    trafic_medium,
    (SELECT AS STRUCT Any_Value(trafic_medium) AS name, COUNT(*) AS count
FROM 
    UNNEST(trafic_medium) AS trafic_medium) AS trafic_medium_2,
FROM 
    medium

基于此线程: How to count frequency of elements in a bigquery array field

但是,这仅显示了 'Any_Value 的数量,并非所有不同的。

我将不胜感激!

附言我在 BigQuery 中的“bigquery-public-dataset.google_analytics_sample”上执行此操作

【问题讨论】:

    标签: sql arrays count google-bigquery


    【解决方案1】:

    以下是 BigQuery 标准 SQL 以帮助您入门

    #standardSQL
    SELECT id, trafic_medium,
      ARRAY(
        SELECT AS STRUCT medium, COUNT(1) `count`
        FROM t.trafic_medium medium
        GROUP BY medium
      ) stats
    FROM `project.dataset.table` t
    

    如果适用于您的问题的样本/虚拟数据,如下例所示

    #standardSQL
    WITH `project.dataset.table` AS (
      SELECT 123 id, ['cpc', 'organic', 'cpc'] trafic_medium UNION ALL
      SELECT 456, ['organic', 'organic', 'cpc']
    )
    SELECT id, trafic_medium,
      ARRAY(
        SELECT AS STRUCT medium, COUNT(1) `count`
        FROM t.trafic_medium medium
        GROUP BY medium
      ) stats
    FROM `project.dataset.table` t
    -- ORDER BY id   
    

    结果将是

    作为一个选项 - 您可以使用以下版本

    #standardSQL
    SELECT id, 
      ARRAY(
        SELECT AS STRUCT medium, `count`
        FROM t.trafic_medium medium
        LEFT JOIN (
          SELECT AS STRUCT medium, COUNT(1) `count`
          FROM t.trafic_medium medium
          GROUP BY medium
        ) stats
        USING(medium) 
      ) trafic_medium  
    FROM `project.dataset.table` t
    -- ORDER BY id   
    

    which(如果应用于相同的虚拟数据)将在下面输出

    这个版本看起来更符合您的预期结果

    【讨论】:

    • 非常感谢,完美解决!在这种情况下 COUNT(1) 有什么作用?
    • 实际上你的第一个版本正是我在“最终解决方案”中寻找的,但新版本确实更适合这个问题!
    • 是的,我有这种感觉——但不确定。截至count(1) - count() 以及 group by 计算项目在给定项目集中的次数(在这种情况下 - 在数组中)
    猜你喜欢
    • 2022-01-23
    • 1970-01-01
    • 1970-01-01
    • 2021-01-30
    • 2019-03-02
    • 2012-05-10
    • 2017-07-25
    • 2020-09-06
    • 1970-01-01
    相关资源
    最近更新 更多