需要将查询从 Redshift 转换为 Presto答案

【问题标题】：Need to convert a query from Redshift to Presto需要将查询从 Redshift 转换为 Presto
【发布时间】：2018-07-22 05:29:58
【问题描述】：

给出以下用 AWS Redhift 编写的查询：

SELECT session_date,'min' as stats,mini as value,product,endpoint
from 
(select 
distinct trunc(joinstart_ev_timestamp) as session_date, 
PERCENTILE_DISC(0.02) WITHIN GROUP (ORDER BY join_time) over(partition by 
trunc(joinstart_ev_timestamp))/1000 as mini,
PERCENTILE_DISC(0.25) WITHIN GROUP (ORDER BY join_time) OVER (partition by         
trunc(joinstart_ev_timestamp))/1000 as first_quartile,
median(join_time) over(partition by trunc(joinstart_ev_timestamp))/1000 as 
jt,
PERCENTILE_DISC(0.75) WITHIN GROUP (ORDER BY join_time) OVER (partition by 
trunc(joinstart_ev_timestamp))/1000 as third_quartile,
PERCENTILE_DISC(0.98) WITHIN GROUP (ORDER BY join_time) over(partition by 
trunc(joinstart_ev_timestamp))/1000 as maxi,
product_name as product,
endpoint as endpoint
from qe_datawarehouse.join_session_fact
where  
trunc(joinstart_ev_timestamp)  between '2018-01-18' and '2018-01-30'
and lower(product_name) LIKE 'gotowebinar%' 
and join_time>0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0  
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 
'V2');

我需要将其转换为相应的 Presto 查询。

我在下面尝试过：

当我在 Presto 中的 Hive 查询上运行时，我收到错误： " 查询失败 (#20180212_044343_00014_jb834): line 5:36: missing 'BY' at '('"
我知道我必须以某种方式在 presto 中使用“approx_percentile()”，但实际上无法使用它。

注意：在 Redshift 查询中，每一列都被视为字符串，但在 Presto 中，数据类型如下所示：

create external table if not exists join_session_fact (
 join_session_fact_id string
,session_tracking_id string
,user_id string
,participant_id string
,meeting_id string
,session_mcs_id string
,browser_name string
,browser_version string
,endpoint string
,entrypoint string
,build_number string
,model_id string
,model_name string
,hardware_net string
,ip_address string
,country string
,region string
,city string
,os_type string
,os_architecture string
,os_locale string
,os_timezone string
,product_name string
,product_version string
,product_tier string
,participant_role string
,timezone string
,joinstart_ev_timestamp timestamp
,joinLaunch_ev_timestamp timestamp
,joinSession_ev_timestamp timestamp
,joinTime_ev_timestamp timestamp
,audioConnect_ev_timestamp timestamp
,connection_type string
,download_start_timestamp timestamp
,download_end_timestamp timestamp
,install_start_timestamp timestamp
,install_end_timestamp timestamp
,password_start_timestamp timestamp
,password_end_timestamp timestamp
,login_start_timestamp timestamp
,login_end_timestamp timestamp
,audioWait_start_timestamp timestamp
,audioWait_end_timestamp timestamp
,hallway_start_timestamp timestamp
,hallway_end_timestamp timestamp
,entrypoint_access_time double
,endpoint_access_time double
,panel_connect_time double
,audio_connect_time double
,install_time_endpoint double
,download_time_endpoint double
,install_time_launcher double
,download_time_launcher double
,join_time double
,process_data_timestamp timestamp
,source_date timestamp
,version string
,event_date timestamp
)
PARTITIONED BY (data_input_date string) 
stored as orc
location '${hiveconf:s3bucket}/${hiveconf:fact_path}/${hiveconf:join_session_fact}/'
TBLPROPERTIES (“orc.compress”=“snappy”);

请注意，当我在 Presto 中运行以下查询时，它工作正常：

select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
approx_percentile(cast(join_time as double),0.50) over (partition by 
cast(joinstart_ev_timestamp as date)) /1000 as jt,
product_name as product,
endpoint as endpoint
from datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date '2018-01-18' and date '2018-01-30'
and lower(product_name) LIKE 'gotowebinar%' 
and join_time > 0 and join_time <= 600000 and join_time is not null 
and audio_connect_time >= 0 
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2'

【问题讨论】：

标签： sql amazon-s3 amazon-redshift presto

【解决方案1】：

可能是 WITHIN GROUP。 AFAIK，不支持那些百分位函数。该错误可能是由于语法无法识别 WITHIN GROUP() 子句。

【讨论】：

嗨，马特，您的回复。我已经用运行正常的查询更新了上述问题。现在，您能帮我找到以下 Redshift 查询部分的相应 Presto 语法：“ PERCENTILE_DISC(0.02) WITHIN GROUP (ORDER BY cast(join_time as double)) over(partition by cast(joinstart_ev_timestamp as date))/ 1000 个迷你“

【解决方案2】：

我找到了到 Presto 的正确转换：

SELECT session_date,'min' as stats,mini as value,product,endpoint
from 
(select 
distinct cast(joinstart_ev_timestamp as date) as session_date, 
approx_percentile(cast(join_time as double),0.02) over (partition by 
cast(joinstart_ev_timestamp as date))/1000 as mini,
approx_percentile(cast(join_time as double),0.25) over (partition by 
cast(joinstart_ev_timestamp as date))/1000 as first_quartile,
approx_percentile(cast(join_time as double),0.50) over (partition by 
cast(joinstart_ev_timestamp as date))/1000 as jt,
approx_percentile(cast(join_time as double),0.75) over (partition by 
cast(joinstart_ev_timestamp as date))/1000 as third_quartile,
approx_percentile(cast(join_time as double),0.98) over (partition by 
cast(joinstart_ev_timestamp as date))/1000 as maxi,
product_name as product,
endpoint as endpoint
from datawarehouse.join_session_fact
where  
cast(joinstart_ev_timestamp as date)  between date_add('day', -16, now()) 
and  date_add('day', -1, now())
 and lower(product_name) LIKE 'gotowebinar%' 
 and join_time>0 and join_time <= 600000 and join_time is not null 
 and audio_connect_time >= 0  
and (entrypoint_access_time >= 0 or entrypoint_access_time is null)
and (panel_connect_time >= 0  or panel_connect_time is null) and version = 'V2')

【讨论】：