【问题标题】:Grouping Data by Changing Status Over Time通过随时间变化的状态对数据进行分组
【发布时间】:2025-12-16 02:50:01
【问题描述】:

我正在尝试将组号分配给数据集中随时间变化的数据的不同行组。在我的示例中,变化的字段是 tran_seq、prog_id、deg-id、cur_id 和enroll_status。当这些字段中的任何一个与上一行不同时,我需要一个新的分组编号。当字段与前一行相同时,分组编号应保持不变。当我尝试 ROW_NUMBER()、RANK() 或 DENSE_RANK() 时,同一组的值会增加(例如,示例中的前 2 行)。我觉得我需要 ORDER BY start_date 因为它是时间数据。

+----+----------+---------+--------+--------+---------------+------------+------------+---------+
|    | tran_seq | prog_id | deg_id | cur_id | enroll_status | start_date |  end_date  | desired |
+----+----------+---------+--------+--------+---------------+------------+------------+---------+
| 1  |    1     |   6     |   9    |   3    |     ENRL      | 2004-08-22 | 2004-12-11 |    1    |
| 2  |    1     |   6     |   9    |   3    |     ENRL      | 2006-01-10 | 2006-05-06 |    1    |
| 3  |    1     |   6     |   9    |   59   |     ENRL      | 2006-08-29 | 2006-12-16 |    2    |
| 4  |    2     |   12    |   23   |   45   |     ENRL      | 2014-01-21 | 2014-05-16 |    3    |
| 5  |    2     |   12    |   23   |   45   |     ENRL      | 2014-08-18 | 2014-12-05 |    3    |
| 6  |    2     |   12    |   23   |   45   |     LOAP      | 2015-01-20 | 2015-05-15 |    4    |
| 7  |    2     |   12    |   23   |   45   |     ENRL      | 2015-08-25 | 2015-12-11 |    5    |
| 8  |    2     |   12    |   23   |   45   |     LOAP      | 2016-01-12 | 2016-05-06 |    6    |
| 9  |    2     |   12    |   23   |   45   |     ENRL      | 2016-05-16 | 2016-08-05 |    7    |
| 10 |    2     |   12    |   23   |   45   |     LOAJ      | 2016-08-23 | 2016-12-02 |    8    |
| 11 |    2     |   12    |   23   |   45   |     ENRL      | 2017-01-18 | 2017-05-05 |    9    |
| 12 |    2     |   12    |   23   |   45   |     ENRL      | 2018-01-17 | 2018-05-11 |    9    |
+----+----------+---------+--------+--------+---------------+------------+------------+---------+

一旦我对数字进行了分组,我想我可以按这些数字进行分组以获得我最终想要的结果:具有开始日期和结束日期的不同状态的时间线。对于上面的示例数据,这将是:

+---+----------+---------+--------+--------+---------------+------------+------------+
|   | tran_seq | prog_id | deg_id | cur_id | enroll_status | start_date |  end_date  |
+---+----------+---------+--------+--------+---------------+------------+------------+
| 1 |    1     |   6     |   9    |   3    |     ENRL      | 2004-08-22 | 2006-05-06 |
| 2 |    1     |   6     |   9    |   59   |     ENRL      | 2004-08-29 | 2006-12-16 |
| 3 |    2     |   12    |   23   |   45   |     ENRL      | 2014-01-21 | 2014-12-05 |
| 4 |    2     |   12    |   23   |   45   |     LOAP      | 2015-01-20 | 2015-05-15 |
| 5 |    2     |   12    |   23   |   45   |     ENRL      | 2015-08-25 | 2015-12-11 |
| 6 |    2     |   12    |   23   |   45   |     LOAP      | 2016-01-12 | 2016-05-06 |
| 7 |    2     |   12    |   23   |   45   |     ENRL      | 2016-05-16 | 2016-08-05 |
| 8 |    2     |   12    |   23   |   45   |     LOAJ      | 2016-08-23 | 2016-12-02 |
| 9 |    2     |   12    |   23   |   45   |     ENRL      | 2017-01-17 | 2018-05-06 |
+---+----------+---------+--------+--------+---------------+------------+------------+

【问题讨论】:

  • 你试过在 ROW_NUMBER() 中使用 PARTITION BY 吗?
  • @FLICKER 这将返回与 OP 正在寻找的完全相反的内容
  • @FLICKER 是的。我使用以下方法创建了一个列:[grp] = ROW_NUMBER() OVER (PARTITION BY tran_seq, prog_id, deg_id, cur_id,enroll_status ORDER BY start_date)。问题是它为前 2 行分配了不同的数字,而不是相同的数字。

标签: sql-server sql-server-2019


【解决方案1】:

这是一个经典的 XY 问题,因为您要求的是不同解决方案的中间步骤,而不是询问解决方案本身。

但是,由于您将总体最终目标作为附录包含在内,因此您可以通过以下方式在没有中间步骤的情况下实现该目标:

declare @t table(tran_seq int, prog_id int, deg_id int, cur_id int, enroll_status varchar(4), start_date date, end_date  date, desired int)
insert into @t values
 (1,6,9,3   ,'ENRL','2004-08-22','2004-12-11',1)
,(1,6,9,3   ,'ENRL','2006-01-10','2006-05-06',1)
,(1,6,9,59  ,'ENRL','2006-08-29','2006-12-16',2)
,(2,12,23,45,'ENRL','2014-01-21','2014-05-16',3)
,(2,12,23,45,'ENRL','2014-08-18','2014-12-05',3)
,(2,12,23,45,'LOAP','2015-01-20','2015-05-15',4)
,(2,12,23,45,'ENRL','2015-08-25','2015-12-11',5)
,(2,12,23,45,'LOAP','2016-01-12','2016-05-06',6)
,(2,12,23,45,'ENRL','2016-05-16','2016-08-05',7)
,(2,12,23,45,'LOAJ','2016-08-23','2016-12-02',8)
,(2,12,23,45,'ENRL','2017-01-18','2017-05-05',9)
,(2,12,23,45,'ENRL','2018-01-17','2018-05-11',9)
;

select tran_seq
      ,prog_id
      ,deg_id
      ,cur_id
      ,enroll_status
      ,min(start_date) as start_date
      ,max(end_date) as end_date
from(select *
           ,row_number() over (order by end_date) - row_number() over (partition by tran_seq,prog_id,deg_id,cur_id,enroll_status order by end_date) as grp
     from @t
    ) AS g
group by tran_seq
        ,prog_id
        ,deg_id
        ,cur_id
        ,enroll_status
        ,grp
order by start_date;

输出

+----------+---------+--------+--------+---------------+------------+------------+
| tran_seq | prog_id | deg_id | cur_id | enroll_status | start_date |  end_date  |
+----------+---------+--------+--------+---------------+------------+------------+
|        1 |       6 |      9 |      3 | ENRL          | 2004-08-22 | 2006-05-06 |
|        1 |       6 |      9 |     59 | ENRL          | 2006-08-29 | 2006-12-16 |
|        2 |      12 |     23 |     45 | ENRL          | 2014-01-21 | 2014-12-05 |
|        2 |      12 |     23 |     45 | LOAP          | 2015-01-20 | 2015-05-15 |
|        2 |      12 |     23 |     45 | ENRL          | 2015-08-25 | 2015-12-11 |
|        2 |      12 |     23 |     45 | LOAP          | 2016-01-12 | 2016-05-06 |
|        2 |      12 |     23 |     45 | ENRL          | 2016-05-16 | 2016-08-05 |
|        2 |      12 |     23 |     45 | LOAJ          | 2016-08-23 | 2016-12-02 |
|        2 |      12 |     23 |     45 | ENRL          | 2017-01-18 | 2018-05-11 |
+----------+---------+--------+--------+---------------+------------+------------+

【讨论】:

  • @iamdave 非常感谢您的回答。我知道外部查询并问了我的问题,因为我遇到了子查询部分的问题。我在这里看到了 Gordon Linoff 的回答 *.com/questions/30814089/grouping-on-status-change 并用它来过滤我更大的数据集到这个。 ROW_NUMBER() - DENSERANK() 很棒,谢谢!
最近更新 更多