【问题标题】:Presto equivalent for Redshift's PERCENTILE_DISCRedshift 的 PERCENTILE_DISC 的 Presto 等效项
【发布时间】:2018-07-23 11:37:24
【问题描述】:

在 Redshift 中给出以下查询:

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC(0.02) WITHIN GROUP (ORDER BY join_time) over(partition by 
trunc(joinstart_ev_timestamp))/1000 as mini,
median(join_time) over(partition by trunc(joinstart_ev_timestamp))/1000 as jt,
product_name as product,
endpoint as endpoint
from qe_datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

我需要将上面的 Query 转换为相应的 Presto 语法。 我写的对应的 Presto 查询是:

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,
approx_percentile(cast(join_time as double),0.50) over (partition by 
cast(joinstart_ev_timestamp as date)) /1000 as jt,
product_name as product,
endpoint as endpoint
from datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotoTest%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

在这里,一切正常,但在行中显示错误:

PERCENTILE_DISC( WITHIN GROUP (ORDER BY cast(join_time as double)) 
    over(partition by cast(joinstart_ev_timestamp as date) )/1000 as mini,

它对应的 Presto 语法是什么?

【问题讨论】:

    标签: mysql amazon-redshift presto amazon-redshift-spectrum


    【解决方案1】:

    如果 Presto 支持嵌套窗口函数,那么您可以使用 NTH_VALUE 和 p*COUNT(*) OVER (PARTITION BY ...) 来查找对应于窗口中“p'th”百分位数的偏移量。由于 Presto 不支持这一点,您需要加入一个计算窗口中记录数的子查询:

    SELECT
      my_table.window_column,
      /* Replace :p with the desired percentile (in your case, 0.02) */
      NTH_VALUE(:p*subquery.records_in_window, my_table.ordered_column)
        OVER (PARTITION BY my_table.window_column ORDER BY my_table.ordered_column BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING)
    FROM my_table
    JOIN (
      SELECT
        window_column,
        COUNT(*) AS records_in_window
      FROM my_table
      GROUP BY window_column
    ) subquery ON subquery.window_column = my_table.window_column
    

    上面在概念上很接近但失败了,因为:p*subquery.records_in_window 是一个浮点数并且偏移量需要是一个整数。你有几个选项来处理这个问题。例如,如果您要查找中位数,则只需四舍五入到最接近的整数即可。如果您要找到第二个百分位数,则四舍五入将不起作用,因为它通常会给您 0 并且偏移量从 1 开始。在这种情况下,将上限四舍五入到最接近的整数可能会更好。

    【讨论】:

      【解决方案2】:

      我在 presto 中对中位数进行了一些研究,并找到了一个适合我的解决方案:

      例如,我有一个连接表 A_join_B,其中包含列 A_id 和 B_id。

      我想找到与单个 B 相关的 A 数量的中位数

      选择 APPPROX_PERCENTILE(计数,0.5) 从 ( SELECT COUNT(*) AS 计数,narrative_id FROM A_join_B 按 B_id 分组 ) 作为计数;

      【讨论】:

        猜你喜欢
        • 2019-11-11
        • 1970-01-01
        • 1970-01-01
        • 2019-09-03
        • 1970-01-01
        • 2020-03-21
        • 1970-01-01
        • 2014-06-28
        • 1970-01-01
        相关资源
        最近更新 更多