【发布时间】:2021-08-02 15:43:26
【问题描述】:
这是去年发布的BigQuery - Compute 0 - 100 percentiles for multiple columns, over multiple groups 的后续内容。该问题与计算表中多个列的 0-100 个百分位数有关。下面是一个可重现的示例。该帖子看起来很长,但主要是可重现的示例+输出屏幕截图,以帮助解决问题:
with
raw_data as (
select 24997 as competitionId, 0.9167 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8571 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7778 as ft2Pct, 0.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8125 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.5625 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.6842 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7317 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8333 as ft2Pct, 0.5 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8000 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7500 as ft2Pct, null as ft3Pct, 1.0 as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.6944 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7500 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8571 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.9091 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.6667 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8261 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8108 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7895 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8571 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7727 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8333 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.6923 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8571 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.9268 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.7660 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8571 as ft2Pct, null as ft3Pct, 0.8333 as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8636 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8036 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.9000 as ft2Pct, null as ft3Pct, null as ftTechPct, null as ftFlagPct union all
select 24997 as competitionId, 0.8108 as ft2Pct, 1.0 as ft3Pct, null as ftTechPct, null as ftFlagPct
),
-- A) Positive Percentiles
-- A1) compute quantiles: will be saved in messy arrays
positive_pctile_arrays as (
select
competitionId
,approx_quantiles(ft2Pct, 10) as ft2Pct
,approx_quantiles(ft3Pct, 10) as ft3Pct
,approx_quantiles(ftTechPct, 10) as ftTechPct
,approx_quantiles(ftFlagPct, 10) as ftFlagPct
from raw_data
group by 1
),
-- A2) and unnest arrays
positive_pctiles as (
select
competitionId
,pctile
,ft2Pct
,ft3Pct
,ftTechPct
,ftFlagPct
from positive_pctile_arrays as a
,a.ft2Pct with offset as pctile
,a.ft3Pct with offset as ft3PctPctile
,a.ftTechPct with offset as ftTechPctPctile
,a.ftFlagPct with offset as ftFlagPctPctile
where
pctile = ft3PctPctile and
pctile = ftTechPctPctile and
pctile = ftFlagPctPctile
)
-- select * from raw_data
select * from positive_pctile_arrays
-- select * from positive_pctiles
几个cmets:
- 我们按
competitionId分组,因为我们的完整数据有>1 个competitionId,即使示例只有1 个。 - 我们希望为这些值计算 0 - 100 个百分位数,但在此示例中,为简洁起见,我们使用
approx_quantiles(., 10)而不是approx_quantiles(., 100)。
在我们的数据中,ftFlagPct 的所有值都为空。因此,在 A1 positive_pctile_arrays 中,ftFlagPct 列是空白的。
因此,当我们尝试在 A2 中取消嵌套这些数组时,看起来where 子句过滤掉了所有行。如果您取消注释select * from positive_pctiles,此最终输出表将为空。
如果我们将 A1 和 A2 中的 ftFlagPct 注释掉,我们大多会得到我们想要的未嵌套表:
我们想要的输出是这个表,有一个额外的 ftFlagPct 列,其中包含所有空值。看来我们需要查询来检测positive_pctile_arrays 中的ftFlagPct 数组列是否为空/空,然后以某种方式处理左连接?
编辑:我们正在研究一种解决方案,我们使用一组虚拟值(例如,全部 999999)识别并替换空数组,然后在最后用空值替换 999999输出。如果我们能解决这个问题,我们会发布答案。
【问题讨论】:
标签: google-bigquery