递归更新开始结束日期以避免重叠答案

【问题标题】：Recursive update of start-end date to avoid overlaps递归更新开始结束日期以避免重叠
【发布时间】：2023-03-19 17:31:01
【问题描述】：

我正在处理 MySQL 中有关意大利工作合同的庞大数据库（行数约为 2000 万）。我的核心表的每一行代表与特定雇主的工人签订的特定合同。为了重建每个工人的工作历史，我在导入过程中对表进行索引时，按工人的识别码和每份合同的开始日期对工人进行了排序。然后，每一行都有自己的渐进式 ID，但同时，我为每一行添加了两个字段，一个引用前一个 ID，另一个引用下一个。仅当前一个或后一个 ID 引用同一个工作人员时，这两个字段才有效地不为空。

我制作了一个小例子来说明我的数据看起来像 here（或者，在下面的脚本中，我创建了一个可重现的小例子）。

工人的历史可能是什么样子

最后应该如何改变

我当前的任务是计算我桌子上每个人的有效工作天数。尽管如此，数据无疑具有巨大重叠的特点。毕竟，每个人可能有几个重叠的合同。例如，一份于 2010 年 1 月 1 日开始并于 2012 年 1 月 1 日结束的合同可能紧随其后在 2012 年 1 月 1 日之前结束的其他几份较短的合同之后开始。因此，如果我计算这个人有效工作的天数，我可能会重复计算。出于这个原因，我想通过更改合同的结束日期来重新安排合同，以获得后续的不重叠合同。唯一可能的重叠可能是一天。我已经制作了一个图形示例，说明个人的工作历史可能是什么样的，以及我想如何在以下两张图片中重新排列它。

由于我无法修改每个合同/行的开始日期，我想根据之前的合同修改每个合同的结束日期。

我按照以下步骤工作：

如果上一个合同的结束日期大于当前合同（每一行）的结束日期，我修改了结束日期，使其等于上一个合同的结束日期。
由于我不知道有多少合同实际上是重叠的（如果每个合同都与前一个和下一个类似，但过去可能有一个重叠的合同），我决定以最大数量迭代这个过程个人可能在我的表中的合同。通过这个过程，我将重叠时间大大延长到这种重叠不再发生的情况。例如，示例中合同 n.3 的结束日期将延长至合同 n.4、n.5 和 n.6。在此迭代过程结束时，它们都将具有相同的结束日期，即今天 12。
完成此程序后，我修改了每份合同的结束日期，如果有重叠，则将其设置为与下一份合同的开始日期相同。

您可以在下面找到我用于此过程的代码。

-- My example table (data_example.csv on GitHub)

drop table if exists mytable;
create table mytable
( 
id INT, 
WORKER_ID INT not null,
EMPLOYER_ID INT not null,
dt_start date not null, -- Contract start date
dt_end date, -- Contract end date
id_prev INT, -- ID of previous contract
dt_start_prev date, -- Start date of previous contract
dt_end_prev date, -- End date of previous contract
id_next INT, -- ID of next contract
dt_start_next date, -- Start date of next contract
dt_end_next date, -- End date of next contract
primary key(id)
);

insert into mytable 
(id, WORKER_ID, EMPLOYER_ID, dt_start, dt_end, 
id_prev, dt_start_prev, dt_end_prev, 
id_next, dt_start_next, dt_end_next)
values
    (1, 5157, 3384722, '2012-01-01', '2012-01-03', NULL, NULL, NULL, 2, '2012-01-02', '2012-01-04'),
    (2, 5157, 3384722, '2012-01-02', '2012-01-04', 1, '2012-01-01', '2012-01-03', 3, '2012-01-04', '2012-01-12'),
    (3, 5157, 96120, '2012-01-04', '2012-01-12', 2, '2012-01-02', '2012-01-04', 4, '2012-01-07', '2012-01-08'),
    (4, 5157, 3384722, '2012-01-07', '2012-01-08', 3, '2012-01-04', '2012-01-12', 5, '2012-01-08', '2012-01-10'),
    (5, 5157, 3384722, '2012-01-08', '2012-01-10', 4, '2012-01-07', '2012-01-08', 6, '2012-01-10', '2012-01-11'),
    (6, 5157, 3954093, '2012-01-10', '2012-01-11', 5, '2012-01-08', '2012-01-10', 7, '2012-01-12', '2012-01-15'),
    (7, 5157, 3384722, '2012-01-12', '2012-01-15', 6, '2012-01-10', '2012-01-11', 8, '2012-01-14', '2012-01-16'),
    (8, 5157, 3954093, '2012-01-14', '2012-01-16', 7, '2012-01-12', '2012-01-15', 9, '2012-01-14', '2012-01-14'),
    (9, 5157, 3384722, '2012-01-14', '2012-01-14', 8, '2012-01-14', '2012-01-16', 10, '2012-01-14', '2012-01-20'),
    (10, 5157, 96120, '2012-01-14', '2012-01-20', 9, '2012-01-14', '2012-01-14', NULL, NULL, NULL),
    (11, 5990, 1940957, '2012-01-01', '2012-01-30', NULL, NULL, NULL, 12, '2012-02-01', '2012-02-15'),
    (12, 5990, 4822105, '2012-02-01', '2012-02-15', 11, '2012-01-01', '2012-01-30', 13, '2012-02-10', '2012-02-10'),
    (13, 5990, 1940957, '2012-02-10', '2012-02-10', 12, '2012-02-01', '2012-02-15', 14, '2012-02-16', '2012-02-20'),
    (14, 5990, 1940957, '2012-02-16', '2012-02-20', 13, '2012-02-10', '2012-02-10', 15, '2012-02-17', '2012-02-28'),
    (15, 5990, 4822105, '2012-02-17', '2012-02-28', 14, '2012-02-16', '2012-02-20', NULL, NULL, NULL);

-- The following table counts the number of contracts for each individual
-- I will use it the determine the maximum number of contract per worker

drop table if exists max_act;
create table max_act 
as select WORKER_ID, count(*) n 
from mytable 
group by WORKER_ID;

set SQL_SAFE_UPDATES = 0;

-- Here I create the procedure

drop procedure if exists doiterate;
delimiter //

create procedure doiterate()
begin
  declare total INT unsigned DEFAULT 0;

  -- The number of iterations is equal to the maximum value in the table 'max_act'

  while total <= (select MAX(n) from max_act) do

  -- If the end date of the previous contract is greater than the end of the current contract
  -- the procedure sets the end date equal to the end date of the previous contract

  update mytable a
    set a.dt_end =
        case 
        when a.dt_end is NOT null and a.dt_end_prev > a.dt_end then a.dt_end_prev
        else a.dt_end end
    ;
    
  -- Here I update in each row the end date of the previous contract

  update mytable a
  left outer join mytable p on a.id_prev = p.id
    set a.dt_end_prev =
        case 
        when a.dt_end_prev is NOT null and a.dt_end_prev != p.dt_end then p.dt_end
        else a.dt_end_prev end
    ;
    
    set total = total + 1;
  end while;
end//  

delimiter ;

CALL doiterate(); 

-- Here I set the end date of each contract equal to the beginning of the next one if there is overlapping

update mytable a
    set a.dt_end =
        case 
        when a.dt_end is NOT null and a.dt_start_next < a.dt_end then a.dt_start_next
        else a.dt_end end
    ;

set SQL_SAFE_UPDATES = 1;

但是，我认为这个程序几乎是最佳的。我估计要花几天时间才能结束。如果有人能给我一些关于如何处理这个问题的提示，我将不胜感激。提前谢谢你。

【问题讨论】：

@Strawberry，你是对的，我在做图形示例时出错了。我已经相应地更新了这个问题。非常感谢您的关注。
提供一个最小的、完整的、可重现的例子。我怀疑很多人会为下载您的 csv 文件而烦恼。相反，提供 CREATE 和 INSERT 语句和/或 sql fiddle；大概 10 行就足够了，每个行只有一个合同 ID、开始日期和结束日期
每个工人每个合同，选择结束日期为LEAST(end_date_from_this_row, start_date_from_next_row) - 下一行的开始日期可以由LEAD(startdate) OVER(PARTITION BY worker id ORDER BY startdate)给出
感谢@CaiusJard 的建议。我尝试使用LAG 函数修改第二个update。但是，该过程需要更长的时间。因此，我通过只考虑至少有一份重叠合同的工人（我以前没有想到它是愚蠢的）来减小表格的大小mytable，并且我只运行了这个“子样本”的程序。因此，通过大幅减少行数，现在整个过程应该是可行的。
如果需要，您可以发布自己的解决方案作为答案。它可以帮助未来的访问者

标签： mysql optimization procedure

【解决方案1】：

正如在一条评论中所述，我尝试使用 LAG() 和 LEAD() 函数按时间顺序连接个人的所有合同。然而，这个过程——也许是我的错——结果变得更慢了。因此，我只是决定只对那些实际上有至少两个重叠合同的工人运行该程序，这可能不是最好的解决方案（当然不是在编码方面），但至少我能够执行该程序（我大概花了 1 天半的时间）。

-- Here I am identifying contracts with an overlapping previous contract  
alter table mytable add column flag_overlap INT default 0;
update mytable set flag_overlap = 1 where dt_end is NOT null and dt_end_prev > dt_end;

-- Creating a table with only those workers with at least two overlapping contracts
drop table if exists mytable_id;
create table mytable_id as select WORKER_ID 
from mytable where flag_overlap = 1 
group by WORKER_ID;

-- This is my table of interests with all the contracts for those workers identified in the previous step
drop table if exists mytable_mod;
create table mytable_mod
as select *
from mytable a
inner join mytable_id b on a.WORKER_ID = b.WORKER_ID 
order by WORKER_ID , dt_start;
alter table mytable_mod add unique index idx_ord_id(id);

-- The rest of the code is the same as the one posted in this question, 
-- simply I referred to the table 'mytable_mod' and no longer to 'mytable'. 

-- [...]

-- At the end I updated the 'revised' end date of my original table 'mytable'

UPDATE mytable a
left outer join mytable_mod b on a.ord_all = b.ord_all
set
    a.dt_end = b.dt_end ,
    a.dt_end_next = b.dt_end_next ,
    a.dt_end_prev = b.dt_end_prev 
    ;

【讨论】：