【问题标题】:Selecting values from a nested column based on a condition applied to another nested column in BigQuery根据应用于 BigQuery 中另一个嵌套列的条件从嵌套列中选择值
【发布时间】:2021-10-25 01:42:11
【问题描述】:

如何使用嵌套列中“特殊”值的索引(例如:该嵌套列中最大值的索引)使用该索引从另一个嵌套列中选择值?

例如,考虑具有以下架构的表:

Field name Type Mode
id STRING NULLABLE
username STRING NULLABLE
▼ products RECORD NULLABLE
     ▼ list RECORD REPEATED
            item STRING NULLABLE
▼ ordered RECORD NULLABLE
     ▼ list RECORD REPEATED
            item INTEGER NULLABLE
total_orders STRING NULLABLE
update_time TIMESTAMP NULLABLE
update_id INTEGER NULLABLE

前几行如下所示:

Row id username products.list.item ordered.list.item total_orders update_time update_id
1 1234 a_turing Apple 1 3 2021-08-14 20:03:22.100846 UTC 121231
      Orange 0      
      Pear 2      
2 5678 g_hopper Apple 0 2 2021-08-15 09:36:48.220464 UTC 121232
      Orange 2      
      Pear 0      
3 1122 a_lovelace Apple 0 1 2021-08-15 13:59:03.441506 UTC 121233
      Orange 1      
      Pear 0      
4 3344 v_nabokov Apple 1 2 2021-08-17 17:34:53.415406 UTC 121234
      Orange 0      
      Pear 1      

我想为每个 id 的最近订单选择订购最多的产品,并排除没有订购最多产品的订单(例如,如果客户订购了相同数量的 Apple、Orange 和 Pear)。

我目前使用的查询是一个 CTE 链,每个产品类型一个,外加一个额外的列,即每个用户订购的产品的最大数量 (max_ordered)。然后我使用 id 列将 CTE 连接在一起:

WITH RANKED_ORDERS AS( 
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY update_time DESC) AS rn
FROM mycompany.engagement.products_ordered),

LATEST_ORDERS AS(
SELECT * FROM RANKED_ORDERS WHERE rn = 1),

-- ---------------------- Apples Ordered -----------------------
APPLES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Apple')
ORDER BY offset_nk),

APPLES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as apples_ordered 
FROM APPLES_INDEXED 
ORDER BY
update_time ASC),

-- ---------------------- Oranges Ordered ----------------------
ORANGES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Orange')
ORDER BY offset_nk),

ORANGES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as oranges_ordered 
FROM ORANGES_INDEXED 
ORDER BY
update_time ASC),

-- ---------------------- Pears Ordered -----------------------
PEARS_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Pear')
ORDER BY offset_nk),

PEARS_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as pears_ordered 
FROM PEARS_INDEXED 
ORDER BY
update_time ASC),

-- --------------- Max Product Ordered per Order --------------
MAX_ORDERED AS(
SELECT
id, username, MAX(orders_per_username.item) as max_ordered, total_orders
FROM
LATEST_ORDERS, UNNEST(ordered.list) as orders_per_username
GROUP BY id, username, total_orders),

-- -------------------- Orders In Columns ---------------------
ORDERS_IN_COLUMNS AS(
SELECT APPLES_ORDERED.username, APPLES_ORDERED.update_time, APPLES_ORDERED.apples_ordered,
ORANGES_ORDERED.oranges_ordered, PEARS_ORDERED.pears_ordered, MAX_ORDERED.max_ordered
FROM APPLES_ORDERED
LEFT JOIN ORANGES_ORDERED ON ORANGES_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN PEARS_ORDERED ON PEARS_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN MAX_ORDERED ON MAX_ORDERED.id = APPLES_ORDERED.id),

-- ------- Orders with a most ordered product -----------------
NO_CONFLICTS AS(
SELECT * FROM ORDERS_IN_COLUMNS
WHERE
max_ordered > 0 AND
(
    (apples_ordered not in (oranges_ordered, pears_ordered) AND apples_ordered = max_ordered)
OR
    (oranges_ordered not in (apples_ordered, pears_ordered) AND oranges_ordered = max_ordered)
OR
    (pears_ordered not in (apples_ordered, oranges_ordered) AND pears_ordered = max_ordered)
)
)

SELECT * FROM NO_CONFLICTS

这将返回下表:

Row username update_time apples_ordered oranges_ordered pears_ordered max_ordered
1 a_turing 2021-08-14 20:03:22.100846 UTC 1 0 2 2
2 g_hopper 2021-08-15 09:36:48.220464 UTC 0 2 0 2
3 a_lovelace 2021-08-15 13:59:03.441506 UTC 0 1 0 1

这是预期的。
但是,我无法弄清楚如何简单地返回一个如下所示的表:

Row username update_time max_product_ordered
1 a_turing 2021-08-14 20:03:22.100846 UTC Pear
2 g_hopper 2021-08-15 09:36:48.220464 UTC Orange
3 a_lovelace 2021-08-15 13:59:03.441506 UTC Orange

我也相当肯定,虽然这个查询基本上可以工作(我最终在 Python 中进行后处理以到达最后一步)它可能由于广泛使用“公用表表达式”。
有没有比我写的更有效的方式来查询我的 BigQuery 表,或者我是否需要完全重组表以加快速度?目前在具有约 10,000 行和 12 列的表上运行此查询需要约 10 秒,我相信速度慢是由于多个 CTE。
在过去的两周里,我一直在努力改进我的查询,但没有取得太大进展。任何帮助都真诚地感激不尽!

【问题讨论】:

    标签: sql google-bigquery common-table-expression


    【解决方案1】:

    考虑以下方法

    with latest_orders as (
      select * from `mycompany.engagement.products_ordered`
      where true 
      qualify 1 = row_number() over(partition by id order by update_time desc)
    ), qualified_items as (
      select *, 
        array(
          select offset from t.ordered.list with offset 
          where true 
          qualify 1 = rank() over(order by item desc) 
        ) items
      from latest_orders t
    )
    select id, username, update_time,
      products.list[offset(items[offset(0)])] as max_product_ordered,
    from qualified_items
    where array_length(items) = 1    
    

    如果应用于您问题中的样本数据 - 输出是

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2017-04-22
      • 2022-11-12
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多