【发布时间】:2021-10-25 01:42:11
【问题描述】:
如何使用嵌套列中“特殊”值的索引(例如:该嵌套列中最大值的索引)使用该索引从另一个嵌套列中选择值?
例如,考虑具有以下架构的表:
| Field name | Type | Mode |
|---|---|---|
| id | STRING | NULLABLE |
| username | STRING | NULLABLE |
| ▼ products | RECORD | NULLABLE |
| ▼ list | RECORD | REPEATED |
| item | STRING | NULLABLE |
| ▼ ordered | RECORD | NULLABLE |
| ▼ list | RECORD | REPEATED |
| item | INTEGER | NULLABLE |
| total_orders | STRING | NULLABLE |
| update_time | TIMESTAMP | NULLABLE |
| update_id | INTEGER | NULLABLE |
前几行如下所示:
| Row | id | username | products.list.item | ordered.list.item | total_orders | update_time | update_id |
|---|---|---|---|---|---|---|---|
| 1 | 1234 | a_turing | Apple | 1 | 3 | 2021-08-14 20:03:22.100846 UTC | 121231 |
| Orange | 0 | ||||||
| Pear | 2 | ||||||
| 2 | 5678 | g_hopper | Apple | 0 | 2 | 2021-08-15 09:36:48.220464 UTC | 121232 |
| Orange | 2 | ||||||
| Pear | 0 | ||||||
| 3 | 1122 | a_lovelace | Apple | 0 | 1 | 2021-08-15 13:59:03.441506 UTC | 121233 |
| Orange | 1 | ||||||
| Pear | 0 | ||||||
| 4 | 3344 | v_nabokov | Apple | 1 | 2 | 2021-08-17 17:34:53.415406 UTC | 121234 |
| Orange | 0 | ||||||
| Pear | 1 |
我想为每个 id 的最近订单选择订购最多的产品,并排除没有订购最多产品的订单(例如,如果客户订购了相同数量的 Apple、Orange 和 Pear)。
我目前使用的查询是一个 CTE 链,每个产品类型一个,外加一个额外的列,即每个用户订购的产品的最大数量 (max_ordered)。然后我使用 id 列将 CTE 连接在一起:
WITH RANKED_ORDERS AS(
SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY update_time DESC) AS rn
FROM mycompany.engagement.products_ordered),
LATEST_ORDERS AS(
SELECT * FROM RANKED_ORDERS WHERE rn = 1),
-- ---------------------- Apples Ordered -----------------------
APPLES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Apple')
ORDER BY offset_nk),
APPLES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as apples_ordered
FROM APPLES_INDEXED
ORDER BY
update_time ASC),
-- ---------------------- Oranges Ordered ----------------------
ORANGES_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Orange')
ORDER BY offset_nk),
ORANGES_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as oranges_ordered
FROM ORANGES_INDEXED
ORDER BY
update_time ASC),
-- ---------------------- Pears Ordered -----------------------
PEARS_INDEXED AS(
SELECT id, username, ordered, flattened_products, offset_nk, update_time, rn
FROM LATEST_ORDERS
CROSS JOIN UNNEST(LATEST_ORDERS.products.list) AS flattened_products
WITH OFFSET as offset_nk
WHERE flattened_products.item in ('Pear')
ORDER BY offset_nk),
PEARS_ORDERED AS(
SELECT id, username, update_time, ordered.list[OFFSET(offset_nk)].item as pears_ordered
FROM PEARS_INDEXED
ORDER BY
update_time ASC),
-- --------------- Max Product Ordered per Order --------------
MAX_ORDERED AS(
SELECT
id, username, MAX(orders_per_username.item) as max_ordered, total_orders
FROM
LATEST_ORDERS, UNNEST(ordered.list) as orders_per_username
GROUP BY id, username, total_orders),
-- -------------------- Orders In Columns ---------------------
ORDERS_IN_COLUMNS AS(
SELECT APPLES_ORDERED.username, APPLES_ORDERED.update_time, APPLES_ORDERED.apples_ordered,
ORANGES_ORDERED.oranges_ordered, PEARS_ORDERED.pears_ordered, MAX_ORDERED.max_ordered
FROM APPLES_ORDERED
LEFT JOIN ORANGES_ORDERED ON ORANGES_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN PEARS_ORDERED ON PEARS_ORDERED.id = APPLES_ORDERED.id
LEFT JOIN MAX_ORDERED ON MAX_ORDERED.id = APPLES_ORDERED.id),
-- ------- Orders with a most ordered product -----------------
NO_CONFLICTS AS(
SELECT * FROM ORDERS_IN_COLUMNS
WHERE
max_ordered > 0 AND
(
(apples_ordered not in (oranges_ordered, pears_ordered) AND apples_ordered = max_ordered)
OR
(oranges_ordered not in (apples_ordered, pears_ordered) AND oranges_ordered = max_ordered)
OR
(pears_ordered not in (apples_ordered, oranges_ordered) AND pears_ordered = max_ordered)
)
)
SELECT * FROM NO_CONFLICTS
这将返回下表:
| Row | username | update_time | apples_ordered | oranges_ordered | pears_ordered | max_ordered |
|---|---|---|---|---|---|---|
| 1 | a_turing | 2021-08-14 20:03:22.100846 UTC | 1 | 0 | 2 | 2 |
| 2 | g_hopper | 2021-08-15 09:36:48.220464 UTC | 0 | 2 | 0 | 2 |
| 3 | a_lovelace | 2021-08-15 13:59:03.441506 UTC | 0 | 1 | 0 | 1 |
这是预期的。
但是,我无法弄清楚如何简单地返回一个如下所示的表:
| Row | username | update_time | max_product_ordered |
|---|---|---|---|
| 1 | a_turing | 2021-08-14 20:03:22.100846 UTC | Pear |
| 2 | g_hopper | 2021-08-15 09:36:48.220464 UTC | Orange |
| 3 | a_lovelace | 2021-08-15 13:59:03.441506 UTC | Orange |
我也相当肯定,虽然这个查询基本上可以工作(我最终在 Python 中进行后处理以到达最后一步)它可能由于广泛使用“公用表表达式”。
有没有比我写的更有效的方式来查询我的 BigQuery 表,或者我是否需要完全重组表以加快速度?目前在具有约 10,000 行和 12 列的表上运行此查询需要约 10 秒,我相信速度慢是由于多个 CTE。
在过去的两周里,我一直在努力改进我的查询,但没有取得太大进展。任何帮助都真诚地感激不尽!
【问题讨论】:
标签: sql google-bigquery common-table-expression