【发布时间】:2017-03-02 16:22:23
【问题描述】:
我正在尝试对一个非常大的表执行同类群组分析。我有一个大约 30M 行的测试表(生产中的两倍多)。 BigQuery 中的查询失败,说明“资源超出......”,这是第 18 层查询(第 1 层是 5 美元,所以它是 90 美元的查询!)
查询:
with cohort_active_user_count as (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
count(distinct`BQ_TABLE`.bot_user_id) as count,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
group by created_at, bot_id
)
select created_at, period as period,
active_users, retained_users, retention, bot_id
from (
select
DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(`BQ_TABLE`.created_at, '-05:00'), DAY) as period,
max(cohort_size.count) as active_users, -- all equal in group
count(distinct future_message.bot_user_id) as retained_users,
count(distinct future_message.bot_user_id) / max(cohort_size.count) as retention,
`BQ_TABLE`.bot_id as bot_id
from `BQ_TABLE`
left join `BQ_TABLE` as future_message on
`BQ_TABLE`.bot_user_id = future_message.bot_user_id
and `BQ_TABLE`.created_at < future_message.created_at
and TIMESTAMP_ADD(`BQ_TABLE`.created_at, interval 720 HOUR) >= future_message.created_at
and `BQ_TABLE`.bot_id = future_message.bot_id
left join cohort_active_user_count as cohort_size on
DATE(`BQ_TABLE`.created_at, '-05:00') = cohort_size.created_at
and `BQ_TABLE`.bot_id = cohort_size.bot_id
group by 1, 2, bot_id) t
where period is not null
and bot_id = 80
order by created_at, period, bot_id
这是所需的输出:
根据我对 BigQuery 的理解,这些连接对性能造成重大影响,因为每个 BigQuery 节点都需要处理它们。该表按天分区,我还没有在这个查询中使用它,但我知道它仍然需要优化。
如何优化此查询或排除使用联接以允许 BigQuery 更高效地并行处理?
【问题讨论】:
-
您是否有失败查询的作业 ID? BigQuery 工程师或许能够就如何优化它提出建议。
标签: mysql sql postgresql google-bigquery bigdata