聚合窗口函数和外连接答案

【问题标题】：Aggregate window function and outer join聚合窗口函数和外连接
【发布时间】：2022-01-15 19:49:06
【问题描述】：

我正在尝试以面向性能的方式解决以下问题。我当前的实现涉及丑陋的循环，而且速度非常慢。

具体来说，我有一张表（交易），其中包含每个客户的各种商品的时间戳订单：

timestamp	customer	item	volume
2000	Joe	A	100
2001	Joe	A	200
2001	Doe	A	100

此外，我还有第二张表（估价）显示商品的价格：

timestamp	item	price
2000	A	1.1
2001	A	1.2
2002	A	1.3

现在，我想根据估值表中的时间戳跟踪每个客户的股票（累积数量）的价值（价格*股票）：

timestamp	customer	item	stock	value
2000	Joe	A	100	110
2001	Joe	A	300	360
2002	Joe	A	300	390
2001	Doe	A	100	120
2002	Doe	A	100	130

本质上，这将是某种形式的（正确的）加入交易和估值。但是，这里的问题是我必须为每个（客户、项目）组合做一个正确的连接。换句话说，对于每个（客户、商品），我都必须加入完整的时间戳集。

我当前（可能非常低效）的解决方案在客户之间循环。对于每个客户，它会创建累积交易量、右连接估值和前向填充（使用最后一个函数）来自交易表的列：

CREATE OR REPLACE FUNCTION public.last_func(anyelement, anyelement)
 RETURNS anyelement
 LANGUAGE sql
 IMMUTABLE STRICT
AS $function$
select $2;
$function$
;

   create or replace function last_func(anyelement, anyelement)
returns anyelement language sql immutable strict
as $$
    select $2;
$$;
    
select 
    valuations.timestamp,
    last(t.customer) over (partition by valuations.item order by valuations.timestamp) as customer,
    valuations.item,
    last(t.stock) over (partition by valuations.item order by valuations.timestamp) as stock,
    last(t.stock) over (partition by valuations.item order by valuations.timestamp) * valuations.price as value
from (select 
    timestamp,
    customer,
    item,
    volume as order_volume,
    sum(volume) over (partition by item order by item, timestamp) as stock
from 
    transactions
where customer = 'Joe') t
right join 
    valuations on t.timestamp = valuations.timestamp and t.item = valuations.item

这似乎相当低效，并且对于大量客户来说变得非常缓慢。有谁知道如何一次性做到这一点？如果你能在这里帮助我，那就太好了。

在此先感谢并致以最诚挚的问候

【问题讨论】：

请输入所需的输出
可以在第三个表中找到想要的输出

标签： sql postgresql

【解决方案1】：

只是一个建议，因为我无法在大量数据上对此进行测试。

但是，如果您使用包含客户和验证的所有预期组合的临时表会怎样。

然后留给客户计算滚动总和。

例如：

create temporary table tmp_customer_valuations (
 timestamp int not null, 
 item varchar(30) not null, 
 customer varchar(30) not null, 
 price decimal(10,1) not null
);

insert into tmp_customer_valuations
(timestamp, item, price, customer)
select v.timestamp, v.item, v.price, c.customer
from valuations v
join (
  select item, customer, min(timestamp) as min_timestamp
  from transactions
  group by item, customer 
) c
  on c.item = v.item
 and c.min_timestamp <= v.timestamp

create index idx_tmp_customer_valuations
on tmp_customer_valuations (timestamp, item)

select 
  tmp.timestamp
, tmp.customer
, tmp.item
--, tr.volume as order_volume,
, sum(coalesce(tr.volume, 0)) 
     over (partition by tmp.item, tmp.customer 
           order by tmp.timestamp) as stock
, tmp.price * sum(coalesce(tr.volume, 0)) 
     over (partition by tmp.item, tmp.customer 
           order by tmp.timestamp) as value
from tmp_customer_valuations tmp
left join transactions tr
  on tr.timestamp = tmp.timestamp 
 and tr.item = tmp.item
 and tr.customer = tmp.customer
order by
 tmp.customer desc,
 tmp.item,
 tmp.timestamp;

timestamp	customer	item	stock	value
2000	Joe	A	100	110.0
2001	Joe	A	300	360.0
2002	Joe	A	300	390.0
2001	Doe	A	100	120.0
2002	Doe	A	100	130.0

db小提琴here

（顺便说一句，还要验证表是否可以使用额外的索引）

【讨论】：

这似乎运行得很快。但是真的需要生成临时表吗？不能直接把 tmp_customer_valuations 放到最后的查询中，postgres 会去整理是否要创建中间索引？
不确定，但您可以尝试将插入的查询放在 CTE 或子查询中，并验证其执行情况。如果没问题，那我猜就更好了。
刚刚试了一下，效果不错：dbfiddle.uk/…非常感谢！

【解决方案2】：

看起来是横向连接的好案例。这并不假定时间戳是相同的。我猜一般情况下估值之间可能没有甚至多次交易。（我什至不确定您是否需要外部联接。）

select v.*, stock * price
from valuations v left join lateral (
    select distinct on (customer) customer,
        sum(volume) over (partition by customer, item order by timestamp) as stock
    from transactions t
    where t.item = v.item and t.timestamp <= v.timestamp
    order by customer, timestamp desc
) t on true
order by customer, timestamp

https://dbfiddle.uk/?rdbms=postgres_10&fiddle=af82f52655dfc55029e430b7933cd899

【讨论】：