【问题标题】:Optimizing cohort analysis on Google BigQuery优化 Google BigQuery 上的同期群分析
【发布时间】:2017-03-02 16:22:23
【问题描述】:

我正在尝试对一个非常大的表执行同类群组分析。我有一个大约 30M 行的测试表(生产中的两倍多)。 BigQuery 中的查询失败,说明“资源超出......”,这是第 18 层查询(第 1 层是 5 美元,所以它是 90 美元的查询!)

查询:

with cohort_active_user_count as (
  select 
    DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
    count(distinct`BQ_TABLE`.bot_user_id) as count,
    `BQ_TABLE`.bot_id as bot_id
  from `BQ_TABLE`
  group by created_at, bot_id
)

select created_at, period as period,
  active_users, retained_users, retention, bot_id
from (
  select 
    DATE(`BQ_TABLE`.created_at, '-05:00') as created_at,
    DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(`BQ_TABLE`.created_at, '-05:00'), DAY) as period,
    max(cohort_size.count) as active_users, -- all equal in group
    count(distinct future_message.bot_user_id) as retained_users,
    count(distinct future_message.bot_user_id) / max(cohort_size.count) as retention,
    `BQ_TABLE`.bot_id as bot_id
  from `BQ_TABLE`
  left join `BQ_TABLE` as future_message on
    `BQ_TABLE`.bot_user_id = future_message.bot_user_id
    and `BQ_TABLE`.created_at < future_message.created_at
    and TIMESTAMP_ADD(`BQ_TABLE`.created_at, interval 720 HOUR) >= future_message.created_at
    and `BQ_TABLE`.bot_id = future_message.bot_id 
  left join cohort_active_user_count as cohort_size on 
    DATE(`BQ_TABLE`.created_at, '-05:00') = cohort_size.created_at
    and `BQ_TABLE`.bot_id = cohort_size.bot_id 
  group by 1, 2, bot_id) t
where period is not null
and bot_id = 80
order by created_at, period, bot_id

这是所需的输出:

根据我对 BigQuery 的理解,这些连接对性能造成重大影响,因为每个 BigQuery 节点都需要处理它们。该表按天分区,我还没有在这个查询中使用它,但我知道它仍然需要优化。

如何优化此查询或排除使用联接以允许 BigQuery 更高效地并行处理?

【问题讨论】:

  • 您是否有失败查询的作业 ID? BigQuery 工程师或许能够就如何优化它提出建议。

标签: mysql sql postgresql google-bigquery bigdata


【解决方案1】:

步骤#1

试试下面
将 JOIN'ing on cohort_active_user_count 移到内部 SELECT 之外,因为我认为这是查询费用昂贵的主要原因之一。正如你所看到的 - 使用 JOIN 而不是 LEFT JOIN 作为这个,因为这里不需要 LEFT

请测试并告诉我们结果

WITH cohort_active_user_count AS (
  SELECT 
    DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
    COUNT(DISTINCT BQ_TABLE.bot_user_id) AS COUNT,
    BQ_TABLE.bot_id AS bot_id
  FROM BQ_TABLE
  GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
  cohort_size.count AS active_users, retained_users, 
  retained_users / cohort_size.count AS retention, t.bot_id
FROM (
  SELECT 
    DATE(BQ_TABLE.created_at, '-05:00') AS created_at,
    DATE_DIFF(DATE(future_message.created_at, '-05:00'), DATE(BQ_TABLE.created_at, '-05:00'), DAY) AS period,
    COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
    BQ_TABLE.bot_id AS bot_id
  FROM BQ_TABLE
  LEFT JOIN BQ_TABLE AS future_message 
    ON BQ_TABLE.bot_user_id = future_message.bot_user_id
    AND BQ_TABLE.created_at < future_message.created_at
    AND TIMESTAMP_ADD(BQ_TABLE.created_at, interval 720 HOUR) >= future_message.created_at
    AND BQ_TABLE.bot_id = future_message.bot_id 
  GROUP BY 1, 2, bot_id
  HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size 
  ON t.created_at = cohort_size.created_at
  AND t.bot_id = cohort_size.bot_id 
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id  

第 2 步

下面的“进一步优化”是基于假设您的 BQ_TABLE 是一个原始数据,其中同一天同一 user_id/bit_id 的多个条目 - 因此增加了内部 SELECT 中 LEFT JOIN 的大量费用。
我建议首先将其汇总,如下所示。除了大幅减少 JOIN 的大小 - 它还消除了每个加入行中从 TIMESTAMP 到 DATE 的所有转换

WITH BQ_TABLE_AGG AS (
  SELECT bot_id, bot_user_id, DATE(BQ_TABLE.created_at, '-05:00') AS created_at
  FROM BQ_TABLE
  GROUP BY 1, 2, 3
),
cohort_active_user_count AS (
  SELECT 
    created_at,
    COUNT(DISTINCT bot_user_id) AS COUNT,
    bot_id AS bot_id
  FROM BQ_TABLE_AGG
  GROUP BY created_at, bot_id
)
SELECT t.created_at, period AS period,
  cohort_size.count AS active_users, retained_users, 
  retained_users / cohort_size.count AS retention, t.bot_id
FROM (
  SELECT 
    BQ_TABLE_AGG.created_at AS created_at,
    DATE_DIFF(future_message.created_at, BQ_TABLE_AGG.created_at, DAY) AS period,
    COUNT(DISTINCT future_message.bot_user_id) AS retained_users,
    BQ_TABLE_AGG.bot_id AS bot_id
  FROM BQ_TABLE_AGG
  LEFT JOIN BQ_TABLE_AGG AS future_message 
    ON BQ_TABLE_AGG.bot_user_id = future_message.bot_user_id
    AND BQ_TABLE_AGG.created_at < future_message.created_at
    AND DATE_ADD(BQ_TABLE_AGG.created_at, INTERVAL 30 DAY) >= future_message.created_at
    AND BQ_TABLE_AGG.bot_id = future_message.bot_id 
  GROUP BY 1, 2, bot_id
  HAVING period IS NOT NULL
) t
JOIN cohort_active_user_count AS cohort_size 
  ON t.created_at = cohort_size.created_at
  AND t.bot_id = cohort_size.bot_id 
WHERE t.bot_id = 80
ORDER BY created_at, period, bot_id

【讨论】:

  • @mnort9 - 你有机会尝试吗?
  • 这很好用,谢谢!我发现使用聚合是可行的方法,但另外删除时间戳转换将查询从约 60 秒减少到约 11 秒。一旦我利用了分区时间,它应该会很快。
【解决方案2】:

为什么要标记 MySQL?

在 MySQL 中,我会改变

max(cohort_size.count) as active_users, -- all equal in group

( SELECT max(count) FROM cohort_active_user_count WHERE ... ) as active_users,

并将JOIN 删除到该表中。如果不这样做,您可能会夸大COUNT(...) 的值!

还要移动除法以将retention 放入外部查询中。

完成此操作后,您还可以将另一个 JOIN 移动到子查询中:

( SELECT count(distinct future_message.bot_user_id)
    FROM ... WHERE ... ) as retained_users,

我会拥有这些索引。请注意,created_at 必须放在最后。

cohort_active_user_count:  INDEX(bot_id, created_at)
future_message: (bot_id, bot_user_id, created_at)

【讨论】:

    【解决方案3】:

    如果您不想在考虑到成本的情况下启用更高的计费层级,这里有一些可能有助于降低 CPU 需求的建议:

    • 如果可以,请使用INNER JOINs 而不是LEFT JOINs。 INNER JOINs 通常应该较少占用 CPU,但是您不会像使用 LEFT JOINs 那样获得不匹配的行。
    • 使用APPROX_COUNT_DISTINCT(expr) 而不是COUNT(DISTINCT expr)。您不会得到准确的计数,但它占用的 CPU 较少,并且可能“足够好”,具体取决于您的需要。

    您还可以考虑手动将查询分解为计算阶段,例如将WITH 子句语句写入表,然后在后续查询中使用它。不过,我不知道具体的成本权衡是什么。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 2017-03-20
      • 1970-01-01
      • 2018-11-11
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-11-21
      相关资源
      最近更新 更多