【发布时间】:2012-10-16 05:58:44
【问题描述】:
我非常怀疑我是否以最有效的方式执行此操作,这就是我在此处标记plpgsql 的原因。我需要在 20 亿行 上为 一千个测量系统 运行它。
您的测量系统通常会在失去连接时报告之前的值,并且它们经常会因为突发但有时会持续很长时间而失去连接。您需要汇总,但是当您这样做时,您需要查看它重复了多长时间并根据该信息制作各种过滤器。假设您正在测量汽车的 mpg,但它停留在 20 mpg 一个小时,然后移动到 20.1,依此类推。您需要在卡住时评估准确性。您还可以放置一些替代规则来查找汽车何时在高速公路上,并且使用窗口功能您可以生成汽车的“状态”并进行分组。废话不多说:
--here's my data, you have different systems, the time of measurement, and the actual measurement
--as well, the raw data has whether or not it's a repeat (hense the included window function
select * into temporary table cumulative_repeat_calculator_data
FROM
(
select
system_measured, time_of_measurement, measurement,
case when
measurement = lag(measurement,1) over (partition by system_measured order by time_of_measurement asc)
then 1 else 0 end as repeat
FROM
(
SELECT 5 as measurement, 1 as time_of_measurement, 1 as system_measured
UNION
SELECT 150 as measurement, 2 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 1 as system_measured
UNION
SELECT 5 as measurement, 1 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 2 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 3 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 4 as time_of_measurement, 2 as system_measured
UNION
SELECT 150 as measurement, 5 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 6 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 7 as time_of_measurement, 2 as system_measured
UNION
SELECT 5 as measurement, 8 as time_of_measurement, 2 as system_measured
) as data
) as data;
--unfortunately you can't have window functions within window functions, so I had to break it down into subquery
--what we need is something to partion on, the 'state' of the system if you will, so I ran a running total of the nonrepeats
--this creates a row that stays the same when your data is repeating - aka something you can partition/group on
select * into temporary table cumulative_repeat_calculator_step_1
FROM
(
select
*,
sum(case when repeat = 0 then 1 else 0 end) over (partition by system_measured order by time_of_measurement asc) as cumlative_sum_of_nonrepeats_by_system
from cumulative_repeat_calculator_data
order by system_measured, time_of_measurement
) as data;
--finally, the query. I didn't bother showing my desired output, because this (finally) got it
--I wanted a sequential count of repeats that restarts when it stops repeating, and starts with the first repeat
--what you can do now is take the average measurement under some condition based on how long it was repeating, for example
select *,
case when repeat = 0 then 0
else
row_number() over (partition by cumlative_sum_of_nonrepeats_by_system, system_measured order by time_of_measurement) - 1
end as ordered_repeat
from cumulative_repeat_calculator_step_1
order by system_measured, time_of_measurement
那么,为了在一个巨大的桌子上运行它,你会做些什么不同的事情,或者你会使用什么替代工具?我在考虑 plpgsql,因为我怀疑这需要在数据库中或在数据插入过程中完成,尽管我通常在加载数据后使用数据。有没有办法在不诉诸子查询的情况下一次性完成?
我已经测试了一个替代方法,但它仍然依赖于子查询,我认为这更快。对于该方法,您可以使用 start_timestamp、end_timestamp、system 创建一个“开始和停止”表。然后加入更大的表,如果时间戳介于两者之间,则将其归类为处于该状态,这本质上是cumlative_sum_of_nonrepeats_by_system 的替代方案。但是,当您这样做时,您会以 1=1 的方式加入数千台设备和数千或数百万个“事件”。你认为这是一个更好的方法吗?
【问题讨论】:
-
考虑发布架构和一小部分数据,其中一些包含“卡住”数据的“运行”和一些正常数据。
-
the raw data has whether or not it's a repeat所以我们可以把它当作给定的,对吗?请提供一个包含数据的表格。目前还不清楚给出了什么。期望的结果已定义,但隐藏在您的查询中。也以明文形式提供。
标签: sql performance postgresql plpgsql window-functions