如何使用窗口函数优化 SQL 查询答案

【问题标题】：How to optimize SQL query with window functions如何使用窗口函数优化 SQL 查询
【发布时间】：2015-11-22 03:13:26
【问题描述】：

这个问题与this有关。我有一个包含设备功率值的表，我需要计算给定时间跨度的功耗并返回 10 个最耗电的设备。我已经生成了 192 台设备和 7742208 条测量记录（每个 40324 条）。这大致是设备在一个月内产生的记录量。

对于这么多数据，我当前的查询需要 40 多秒才能执行，因为时间跨度以及设备和测量的数量可能要高得多。我是否应该尝试使用不同于 lag() OVER PARTITION 的方法来解决这个问题，以及可以进行哪些其他优化？我非常感谢您提供代码示例的建议。

PostgreSQL 版本 9.4

使用示例值查询：

SELECT
  t.device_id,
  sum(len_y*(extract(epoch from len_x))) AS total_consumption
FROM (
    SELECT
      m.id,
      m.device_id,
      m.power_total,
      m.created_at,
      m.power_total+lag(m.power_total) OVER (
        PARTITION BY device_id
        ORDER BY m.created_at
      ) AS len_y,
      m.created_at-lag(m.created_at) OVER (
        PARTITION BY device_id
        ORDER BY m.created_at
      ) AS len_x
    FROM
      measurements AS m
  WHERE m.created_at BETWEEN '2015-07-30 13:05:24.403552+00'::timestamp
    AND '2015-08-27 12:34:59.826837+00'::timestamp
) AS t
GROUP BY t.device_id
ORDER BY total_consumption
DESC LIMIT 10;

表信息：

    Column    |           Type           |                         Modifiers
--------------+--------------------------+----------------------------------------------------------
 id           | integer                  | not null default nextval('measurements_id_seq'::regclass)
 created_at   | timestamp with time zone | default timezone('utc'::text, now())
 power_total  | real                     |
 device_id    | integer                  | not null
Indexes:
    "measurements_pkey" PRIMARY KEY, btree (id)
    "measurements_device_id_idx" btree (device_id)
    "measurements_created_at_idx" btree (created_at)
Foreign-key constraints:
    "measurements_device_id_fkey" FOREIGN KEY (device_id) REFERENCES devices(id)

查询计划：

Limit  (cost=1317403.25..1317403.27 rows=10 width=24) (actual time=41077.091..41077.094 rows=10 loops=1)
->  Sort  (cost=1317403.25..1317403.73 rows=192 width=24) (actual time=41077.089..41077.092 rows=10 loops=1)
Sort Key: (sum((((m.power_total + lag(m.power_total) OVER (?))) * date_part('epoch'::text, ((m.created_at - lag(m.created_at) OVER (?)))))))
Sort Method: top-N heapsort  Memory: 25kB
->  GroupAggregate  (cost=1041700.67..1317399.10 rows=192 width=24) (actual time=25361.013..41076.562 rows=192 loops=1)
Group Key: m.device_id
->  WindowAgg  (cost=1041700.67..1201314.44 rows=5804137 width=20) (actual time=25291.797..37839.727 rows=7742208 loops=1)
->  Sort  (cost=1041700.67..1056211.02 rows=5804137 width=20) (actual time=25291.746..30699.993 rows=7742208 loops=1)
Sort Key: m.device_id, m.created_at
Sort Method: external merge  Disk: 257344kB
->  Seq Scan on measurements m  (cost=0.00..151582.05 rows=5804137 width=20) (actual time=0.333..5112.851 rows=7742208 loops=1)
Filter: ((created_at >= '2015-07-30 13:05:24.403552'::timestamp without time zone) AND (created_at <= '2015-08-27 12:34:59.826837'::timestamp without time zone))

Planning time: 0.351 ms
Execution time: 41114.883 ms

查询生成测试表和数据：

CREATE TABLE measurements (
    id          serial primary key,
    device_id   integer,
    power_total real,
    created_at  timestamp
);

INSERT INTO measurements(
    device_id,
    created_at,
    power_total
  )
SELECT
  device_id,
  now() + (i * interval '1 minute'),
  random()*(50-1)+1
FROM (
  SELECT
    DISTINCT(device_id),
    generate_series(0,10) AS i
 FROM (
  SELECT
    generate_series(1,5) AS device_id
  ) AS dev_ids
) AS gen_table;

【问题讨论】：

在 (device_id, created_at) 上的复合索引怎么样？顺便说一句，恕我直言，您应该在使用前将m.power_total+lag(m.power_total) 除以二。（或者只取平均值）
+1 很久以来我见过的最好的问题。写得很好，样本很合适。我在一秒钟内创建了示例数据库。现在我应该在series 中输入什么值来生成与您当前大小相似的数据库？
您的where 条件不会删除任何行。这是故意的吗？排序也在磁盘上完成：external merge Disk: 257344kB 这需要很长时间（您的执行计划丢失了缩进，因此有点难以阅读）。如果您增加会话的work_mem 直到在内存中完成排序，您应该会看到更好的性能。
这是我在(device_id, created_at)上创建索引时得到的：explain.depesz.com/s/7XSj
@a_horse_with_no_name 是的，这是本演示案例的意图。感谢您的记忆提示。将 work_mem 从默认的 4MB 增加到 10MB 使我达到 35s 。我也尝试了更高的值，但那些只会减慢执行时间。使用 820MB，我能够实现 Sort Method: quicksort Memory: 798633kB 执行时间为 55 秒。

标签： sql postgresql optimization query-optimization

【解决方案1】：

我会尝试将部分计算移到行插入阶段。

添加新列：

alter table measurements add consumption real;

更新列：

with m1 as (
    select
        id, power_total, created_at,
        lag(power_total) over (partition by device_id order by created_at) prev_power_total,
        lag(created_at) over (partition by device_id order by created_at) prev_created_at
    from measurements
    )
update measurements m2
set consumption = 
    (m1.power_total+ m1.prev_power_total)*
    extract(epoch from m1.created_at- m1.prev_created_at)
from m1
where m2.id = m1.id;

创建触发器：

create or replace function before_insert_on_measurements()
returns trigger language plpgsql
as $$
declare
    rec record;
begin
    select power_total, created_at into rec
    from measurements
    where device_id = new.device_id
    order by created_at desc
    limit 1;
    new.consumption:= 
        (new.power_total+ rec.power_total)*
        extract(epoch from new.created_at- rec.created_at);
    return new;
end $$;

create trigger before_insert_on_measurements
before insert on measurements
for each row execute procedure before_insert_on_measurements();

查询：

select device_id, sum(consumption) total_consumption
from measurements
-- where conditions
group by 1
order by 1

【讨论】：

谢谢！使用这种方法，我能够实现 9 秒的执行时间。顺便说一句，它应该按第二列排序。 =)

【解决方案2】：

我认为你的问题是另一个问题。

我创建了包含 8 M 行的样本数据（200 个设备，40000 个测量值）

而且响应速度非常快（2 秒）

Postgres 9.3 - iCore 5 / 3.2 mhz / 8gb / sata 硬盘 / Windows 7
我还没有创建索引（你错过了设置脚本中的那部分）

【讨论】：

您是否确保where 条件不会过滤掉 800 万行中的任何一行？因为这就是原始查询中发生的事情。如果我运行 800 万行的样本，大约需要 12 秒（这仍然比原始时间快）
@a_horse_with_no_name 我只是从 OP 问题中复制选择。将再次检查。
@a_horse_with_no_name 为什么说不过滤任何行？内部选择 WHERE m.created_at BETWEEN '2015-07-30 13:05:24.403552+00'::timestamp AND '2015-08-27 12:34:59.826837+00'::timestamp 带来 5800 行，没有 where 选择带来 8M 记录并需要 300 秒。
在原始计划中，Seq Scan on measurements 下方没有Rows Removed by Filter: 步骤。这表明WHERE 条件没有删除任何内容。
@a_horse_with_no_name 没有 where 整个查询需要 22 秒。仍然不到 40 秒。现在将测试索引。