【问题标题】:Take first, second, third ... last value and selecting rows (Window function with filter and lag)取第一个,第二个,第三个......最后一个值并选择行(带有过滤器和滞后的窗口函数)
【发布时间】:2025-12-28 01:35:07
【问题描述】:

我想用过滤子句执行窗口函数,例如:

LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC) AS "A_lag_1"

但是,Postgres 不支持此操作,但我无法确定其他方法。详情如下

挑战

输入tab_A:

+----+------+------+
| id | type | date |
+----+------+------+
|  1 | A    |   30 |
|  1 | A    |   25 |
|  1 | A    |   20 |
|  1 | B    |   29 |
|  1 | B    |   28 |
|  1 | B    |   21 |
|  1 | C    |   24 |
|  1 | C    |   22 |
+----+------+------+

期望的输出:

+----+------+------+---------+---------+---------+---------+---------+---------+
| id | type | date | A_lag_1 | A_lag_2 | B_lag_1 | B_lag_2 | C_lag_1 | C_lag_2 |
+----+------+------+---------+---------+---------+---------+---------+---------+
|  1 | A    |   30 |      25 |      20 |      29 |      28 |      24 |      22 |
|  1 | A    |   25 |      20 |         |         |         |      24 |      22 |
|  1 | A    |   20 |         |         |         |         |         |         |
|  1 | B    |   29 |      25 |      20 |      28 |      21 |      24 |      22 |
|  1 | B    |   28 |      25 |      20 |      21 |         |      24 |      22 |
|  1 | B    |   21 |      20 |         |         |         |      24 |      22 |
|  1 | C    |   24 |      20 |         |      21 |         |      22 |         |
|  1 | C    |   22 |      20 |         |      21 |         |         |         |
+----+------+------+---------+---------+---------+---------+---------+---------+

言辞:

  • 对于每一行,选择它之前出现的所有行(参见date 列)
  • 然后对于每个type('A'、'B'、'C'),将最新的date 放入A_lag_1,第二个将最近的(按日期)值放入A_lag_2 for @ 987654334@'A',B_lag_1B_lag_2'B'等。

上面的例子非常简化,在我的实际用例中会有更多的id 值,更多的滞后列迭代A_lag_X 和类型。

可能的解决方案 这个挑战似乎非常适合 window function,因为我想保持相同数量的行 tab_A 并附加与该行相关但过去的信息。

所以使用窗口函数构造它(sqlfiddle):

SELECT
  id, type, "date",
  LAG("date", 1) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_1",
  LAG("date", 2) FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "A_lag_2",
  LAG("date", 1) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_1",
  LAG("date", 2) FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "B_lag_2",
  LAG("date", 1) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_1",
  LAG("date", 2) FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC) AS "C_lag_2"
FROM tab_A

但是,我收到以下错误:

错误:非聚合窗口函数未实现 FILTER 位置:30

虽然documentation 中引用了此错误,但我无法确定另一种方法。

任何帮助将不胜感激。


其他 SO 问题:

  • 1. 这个答案依赖于使用聚合函数,例如 max。但是,这在尝试检索倒数第二行、倒数第三行等时不起作用。

【问题讨论】:

    标签: sql postgresql window-functions postgresql-9.6


    【解决方案1】:

    使用横向连接的另一种可能的解决方案 (fiddle):

    SELECT
        a.id,
        a.type,
        a."date",
        c.nn_row,
        c.type,
        c."date" as "date_joined"
    FROM tab_A AS a
    LEFT JOIN LATERAL (
        SELECT
            type,
            "date",
            row_number() OVER (PARTITION BY id, type ORDER BY id ASC, "date" DESC) as nn_row
        FROM tab_A AS b
        WHERE a."date" > b."date"
    ) AS c on true
    WHERE c.nn_row <= 5
    

    这会创建一个像这样的长表:

    +----+------+------+--------+------+-------------+
    | id | type | date | nn_row | type | date_joined |
    +----+------+------+--------+------+-------------+
    |  1 | A    |   30 |      1 | A    |          25 |
    |  1 | A    |   30 |      2 | A    |          20 |
    |  1 | A    |   30 |      1 | B    |          29 |
    |  1 | A    |   30 |      2 | B    |          28 |
    |  1 | A    |   30 |      3 | B    |          21 |
    |  1 | A    |   30 |      1 | C    |          24 |
    |  1 | A    |   30 |      2 | C    |          22 |
    |  1 | A    |   25 |      1 | A    |          20 |
    |  1 | A    |   25 |      1 | B    |          21 |
    |  1 | A    |   25 |      1 | C    |          24 |
    |  1 | A    |   25 |      2 | C    |          22 |
    |  1 | B    |   29 |      1 | A    |          25 |
    |  1 | B    |   29 |      2 | A    |          20 |
    |  1 | B    |   29 |      1 | B    |          28 |
    |  1 | B    |   29 |      2 | B    |          21 |
    |  1 | B    |   29 |      1 | C    |          24 |
    |  1 | B    |   29 |      2 | C    |          22 |
    |  1 | B    |   28 |      1 | A    |          25 |
    |  1 | B    |   28 |      2 | A    |          20 |
    |  1 | B    |   28 |      1 | B    |          21 |
    |  1 | B    |   28 |      1 | C    |          24 |
    |  1 | B    |   28 |      2 | C    |          22 |
    |  1 | B    |   21 |      1 | A    |          20 |
    |  1 | C    |   24 |      1 | A    |          20 |
    |  1 | C    |   24 |      1 | B    |          21 |
    |  1 | C    |   24 |      1 | C    |          22 |
    |  1 | C    |   22 |      1 | A    |          20 |
    |  1 | C    |   22 |      1 | B    |          21 |
    +----+------+------+--------+------+-------------+
    

    之后,您可以转向所需的输出。

    但是,这在一个小样本上对我有用,但在整个表上 Postgres 磁盘空间不足(即使我有 50GB 可用):

    错误:无法写入散列连接临时文件:设备上没有剩余空间

    我已在此处发布此解决方案,因为它可能适用于拥有较小桌子的其他人

    【讨论】:

      【解决方案2】:

      由于FILTER 子句确实适用于聚合函数,我决定write my own

      ----- N = 1
      -- State transition function
      -- agg_state: the current state, el: new element
      create or replace function lag_agg_sfunc_1(agg_state point, el float)
          returns point
          immutable
          language plpgsql
          as $$
      declare
          i integer;
          stored_value float;
      begin
          i := agg_state[0];
          stored_value := agg_state[1];
      
          i := i + 1; -- First row i=1
          if i = 1 then
              stored_value := el;
          end if;
          return point(i, stored_value);
      end;
      $$;
      
      -- Final function
      --DROP FUNCTION lag_agg_ffunc_1(point) CASCADE;
      create or replace function lag_agg_ffunc_1(agg_state point)
          returns float
          immutable
          strict
          language plpgsql
          as $$
      begin
        return agg_state[1];
      end;
      $$;
      
      -- Aggregate function
      drop aggregate if exists lag_agg_1(double precision);
      create aggregate lag_agg_1 (float) (
          sfunc = lag_agg_sfunc_1,
          stype = point,
          finalfunc = lag_agg_ffunc_1,
          initcond = '(0,0)'
      );
      
      
      ----- N = 2
      -- State transition function
      -- agg_state: the current state, el: new element
      create or replace function lag_agg_sfunc_2(agg_state point, el float)
          returns point
          immutable
          language plpgsql
          as $$
      declare
          i integer;
          stored_value float;
      begin
          i := agg_state[0];
          stored_value := agg_state[1];
      
          i := i + 1; -- First row i=1
          if i = 2 then
              stored_value := el;
          end if;
          return point(i, stored_value);
      end;
      $$;
      
      -- Final function
      --DROP FUNCTION lag_agg_ffunc_2(point) CASCADE;
      create or replace function lag_agg_ffunc_2(agg_state point)
          returns float
          immutable
          strict
          language plpgsql
          as $$
      begin
        return agg_state[1];
      end;
      $$;
      
      -- Aggregate function
      drop aggregate if exists lag_agg_2(double precision);
      create aggregate lag_agg_2 (float) (
          sfunc = lag_agg_sfunc_2,
          stype = point,
          finalfunc = lag_agg_ffunc_2,
          initcond = '(0,0)'
      );
      

      您可以将上述聚合函数lag_agg_1lag_agg_2与原问题中的窗口表达式一起使用:

      SELECT
        id, type, "date",
        NULLIF(lag_agg_1("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_1",
        NULLIF(lag_agg_2("date") FILTER (WHERE type='A') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "A_lag_2",
        NULLIF(lag_agg_1("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_1",
        NULLIF(lag_agg_2("date") FILTER (WHERE type='B') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "B_lag_2",
        NULLIF(lag_agg_1("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_1",
        NULLIF(lag_agg_2("date") FILTER (WHERE type='C') OVER (PARTITION BY id ORDER BY id ASC, "date" DESC ROWS BETWEEN 1 FOLLOWING AND UNBOUNDED FOLLOWING), 0) AS "C_lag_2"
      FROM tab_A
      ORDER BY id ASC, type, "date" DESC
      

      与其他选项相比,它的执行速度相当快。一些可以改进的地方:

      • 我无法确定如何正确处理空值,因此最后通过将所有 0 转换为 NULL 来伪造结果。这会在某些情况下导致问题
      • 我刚刚复制并粘贴了每个 lag_X 的函数,因为我无法确定如何对其进行参数化

      任何有关上述内容的帮助将不胜感激

      【讨论】:

        【解决方案3】:

        你可以试试下面的方法。

        SELECT
        dt.* ,
        (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'A' AND dt.A_lag_1 >  b.dateVAL  ) AS "A_lag_2",
        (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'B' AND dt.B_lag_1 >  b.dateVAL  ) AS "B_lag_2" ,
        (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'C' AND dt.C_lag_1 >  b.dateVAL  ) AS "C_lag_2"
        FROM
        (
        SELECT
          a.id, a.type, a.dateVAL,
         (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'A' AND a.dateVAL >  b.dateVAL  )  as A_lag_1,
         (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'B' AND a.dateVAL >  b.dateVAL  )  as B_lag_1,
         (SELECT MAX(b.dateVAL)  FROM tab_A  b WHERE b.type = 'C' AND a.dateVAL >  b.dateVAL  )  as C_lag_1
        FROM tab_A a
        )   dt
        

        这是Fiddle 链接。这可能不是最有效的方法。

        【讨论】:

        • 感谢您的回答。我现在正在测试它是否有效:)
        • 此方法适用于样本,但在整个表上执行需要很长时间(我让它运行了 20 小时但从未完成)
        最近更新 更多