展平相交的时间跨度答案

【问题标题】：Flattening intersecting timespans展平相交的时间跨度
【发布时间】：2010-11-01 03:31:17
【问题描述】：

我有大量数据，其中包含给定 ID 的开始和停止时间，我需要将所有相交和相邻的时间跨度展平为一个组合时间跨度。下面贴出的样例数据都是同一个ID，所以我就不一一列举了。

为了让事情更清楚一点，请查看 03.06.2009 的示例数据：

以下时间跨度重叠或连续，需要合并为一个时间跨度

05:54:48 - 10:00:13
09:26:45 - 09:59:40

生成的时间跨度将从 05:54:48 到 10:00:13。由于 10:00:13 和 10:12:50 之间存在间隔，因此我们还有以下时间跨度：

10:12:50 - 10:27:25
10:13:12 - 11:14:56
10:27:25 - 10:27:31
10:27:39 - 13:53:38
11:14:56 - 11:15:03
11:15:30 - 14:02:14
13:53:38 - 13:53:43
14:02:14 - 14:02:31

这导致从 10:12:50 到 14:02:31 的一个合并时间跨度，因为它们重叠或相邻。

您将在下面找到我需要的示例数据和展平数据。持续时间列只是提供信息。

任何解决方案（无论是否 SQL）都值得赞赏。

编辑：由于有许多不同且有趣的解决方案，我正在通过添加约束来完善我的原始问题，以查看“最佳”（如果有的话）解决方案冒泡：

我正在通过 ODBC 从另一个系统获取数据。无法为我更改表格布局或添加索引
数据仅按日期列索引（时间部分不是）
每天大约有 2.5k 行
估计的数据使用模式大致如下：
- 大多数情况下（比如说 90%）用户只会查询一两天（2.5k - 5k 行）
- 有时 (9%) 范围最长为一个月（~75k 行）
- 很少 (1%) 范围会长达一年（约 90 万行）
对于典型情况，查询应该很快，而对于罕见情况，查询不应“永远持续”。
查询一年的数据大约需要 5 分钟（无连接的普通选择）

在这些限制条件下，最佳解决方案是什么？恐怕大多数解决方案都会非常慢，因为它们加入日期和时间的组合，在我的情况下这不是索引字段。

您会在客户端还是服务器端进行所有合并？您会首先创建一个优化的临时表并使用该表的建议解决方案之一吗？直到现在我都没有时间测试解决方案，但我会随时通知您最适合我的解决方案。

样本数据：

Date       | Start    | Stop
-----------+----------+---------
02.06.2009 | 05:55:28 | 09:58:27
02.06.2009 | 10:15:19 | 13:58:24
02.06.2009 | 13:58:24 | 13:58:43
03.06.2009 | 05:54:48 | 10:00:13
03.06.2009 | 09:26:45 | 09:59:40
03.06.2009 | 10:12:50 | 10:27:25
03.06.2009 | 10:13:12 | 11:14:56
03.06.2009 | 10:27:25 | 10:27:31
03.06.2009 | 10:27:39 | 13:53:38
03.06.2009 | 11:14:56 | 11:15:03
03.06.2009 | 11:15:30 | 14:02:14
03.06.2009 | 13:53:38 | 13:53:43
03.06.2009 | 14:02:14 | 14:02:31
04.06.2009 | 05:48:27 | 09:58:59
04.06.2009 | 06:00:00 | 09:59:07
04.06.2009 | 10:15:52 | 13:54:52
04.06.2009 | 10:16:01 | 13:24:20
04.06.2009 | 13:24:20 | 13:24:24
04.06.2009 | 13:24:32 | 14:00:39
04.06.2009 | 13:54:52 | 13:54:58
04.06.2009 | 14:00:39 | 14:00:49
05.06.2009 | 05:53:58 | 09:59:12
05.06.2009 | 10:16:05 | 13:59:08
05.06.2009 | 13:59:08 | 13:59:16
06.06.2009 | 06:04:00 | 10:00:00
06.06.2009 | 10:16:54 | 10:18:40
06.06.2009 | 10:18:40 | 10:18:45
06.06.2009 | 10:23:00 | 13:57:00
06.06.2009 | 10:23:48 | 13:57:54
06.06.2009 | 13:57:21 | 13:57:38
06.06.2009 | 13:57:54 | 13:57:58
07.06.2009 | 21:59:30 | 01:58:49
07.06.2009 | 22:12:16 | 01:58:39
07.06.2009 | 22:12:25 | 01:58:28
08.06.2009 | 02:10:33 | 05:56:11
08.06.2009 | 02:10:43 | 05:56:23
08.06.2009 | 02:10:49 | 05:55:59
08.06.2009 | 05:55:59 | 05:56:01
08.06.2009 | 05:56:11 | 05:56:14
08.06.2009 | 05:56:23 | 05:56:27

扁平化结果：

Date       | Start    | Stop     | Duration
-----------+----------+----------+---------
02.06.2009 | 05:55:28 | 09:58:27 | 04:02:59
02.06.2009 | 10:15:19 | 13:58:43 | 03:43:24
03.06.2009 | 05:54:48 | 10:00:13 | 04:05:25
03.06.2009 | 10:12:50 | 14:02:31 | 03:49:41
04.06.2009 | 05:48:27 | 09:59:07 | 04:10:40
04.06.2009 | 10:15:52 | 14:00:49 | 03:44:58
05.06.2009 | 05:53:58 | 09:59:12 | 04:05:14
05.06.2009 | 10:16:05 | 13:59:16 | 03:43:11
06.06.2009 | 06:04:00 | 10:00:00 | 03:56:00
06.06.2009 | 10:16:54 | 10:18:45 | 00:01:51
06.06.2009 | 10:23:00 | 13:57:58 | 03:34:58
07.06.2009 | 21:59:30 | 01:58:49 | 03:59:19
08.06.2009 | 02:10:33 | 05:56:27 | 03:45:54

【问题讨论】：

您能否判断开始和停止时间之间是否超过 24 小时？或者这不是您的数据集的问题？
@Ed：时间跨度大多在一个班次内，即从06:00到14:00、14:00到22:00和22:00到06:00。正如您所看到的，通常早一点开始（例如 5:55）然后晚一点结束。
一方面你写“每天大约有 2.5k 行”，另一方面“时间跨度大多在一个班次内”8 小时。那么，您希望加入的这些时间间隔通常是多长时间？
@Matt：我的意思是相交和相邻的时间跨度会导致合并的时间跨度通常在一个班次内。
回答 Ed 最初的问题：不，不会有一个大于 24 小时的开始和停止时间。

标签： sql sql-server sql-server-2005 algorithm datetime

【解决方案1】：

这是一个仅限 SQL 的解决方案。我使用 DATETIME 作为列。在我看来，单独存储时间是一个错误，因为当时间超过午夜时你会遇到问题。如果需要，您可以调整它以处理这种情况。该解决方案还假设开始时间和结束时间不为空。同样，如果不是这种情况，您可以根据需要进行调整。

解决方案的一般要点是获取不与任何其他跨度重叠的所有开始时间，获取不与任何跨度重叠的所有结束时间，然后将两者匹配在一起。

结果与您的预期结果相符，但在一种情况下，手动检查看起来您的预期输出有误。 6 号应该有一个跨度结束于 2009-06-06 10:18:45.000。

SELECT
     ST.start_time,
     ET.end_time
FROM
(
     SELECT
          T1.start_time
     FROM
          dbo.Test_Time_Spans T1
     LEFT OUTER JOIN dbo.Test_Time_Spans T2 ON
          T2.start_time < T1.start_time AND
          T2.end_time >= T1.start_time
     WHERE
          T2.start_time IS NULL
) AS ST
INNER JOIN
(
     SELECT
          T3.end_time
     FROM
          dbo.Test_Time_Spans T3
     LEFT OUTER JOIN dbo.Test_Time_Spans T4 ON
          T4.end_time > T3.end_time AND
          T4.start_time <= T3.end_time
     WHERE
          T4.start_time IS NULL
) AS ET ON
     ET.end_time > ST.start_time
LEFT OUTER JOIN
(
     SELECT
          T5.end_time
     FROM
          dbo.Test_Time_Spans T5
     LEFT OUTER JOIN dbo.Test_Time_Spans T6 ON
          T6.end_time > T5.end_time AND
          T6.start_time <= T5.end_time
     WHERE
          T6.start_time IS NULL
) AS ET2 ON
     ET2.end_time > ST.start_time AND
     ET2.end_time < ET.end_time
WHERE
     ET2.end_time IS NULL

【讨论】：

你说得对，我错过了从 10:16:54 到 10:18:45 的跨度。相应地更正了预期结果。我通过 ODBC 从另一个系统获取数据，无法更改基础表格式。
感谢您的出色解决方案。它也对我有用，除非我有 2 条结束时间完全相同的记录。我将任何 NULL 结束时间设置为当前时间戳以获得我想要的计算，但是我从这个查询中得到重复的结果。我可以只使用DISTINCT，但我想知道是否有一种简单的方法可以修改它以解决重复的开始/结束时间。另外，这可以处理只有一条记录的情况吗？

【解决方案2】：

在MySQL:

SELECT  grouper, MIN(start) AS group_start, MAX(end) AS group_end
FROM    (
        SELECT  start,
                end,
                @r := @r + (@edate < start) AS grouper,
                @edate := GREATEST(end, CAST(@edate AS DATETIME))
        FROM    (
                SELECT  @r := 0,
                        @edate := CAST('0000-01-01' AS DATETIME)
                ) vars,
                (
                SELECT  rn_date + INTERVAL TIME_TO_SEC(rn_start) SECOND AS start,
                        rn_date + INTERVAL TIME_TO_SEC(rn_end) SECOND + INTERVAL (rn_start > rn_end) DAY AS end
                FROM    t_ranges
                ) q
        ORDER BY
                start
        ) q
GROUP BY
        grouper
ORDER BY
        group_start

我博客中的以下文章描述了SQL Server 的相同决定：

Flattening timespans: SQL Server

这是执行此操作的函数：

DROP FUNCTION fn_spans
GO
CREATE FUNCTION fn_spans(@p_from DATETIME, @p_till DATETIME)
RETURNS @t TABLE
        (
        q_start DATETIME NOT NULL,
        q_end DATETIME NOT NULL
        )
AS
BEGIN
        DECLARE @qs DATETIME
        DECLARE @qe DATETIME
        DECLARE @ms DATETIME
        DECLARE @me DATETIME
        DECLARE cr_span CURSOR FAST_FORWARD
        FOR
        SELECT  s_date + s_start AS q_start,
                s_date + s_stop + CASE WHEN s_start < s_stop THEN 0 ELSE 1 END AS q_end
        FROM    t_span
        WHERE   s_date BETWEEN @p_from - 1 AND @p_till
                AND s_date + s_start >= @p_from
                AND s_date + s_stop <= @p_till
        ORDER BY
                q_start
        OPEN    cr_span
        FETCH   NEXT
        FROM    cr_span
        INTO    @qs, @qe
        SET @ms = @qs
        SET @me = @qe
        WHILE @@FETCH_STATUS = 0
        BEGIN
                FETCH   NEXT
                FROM    cr_span
                INTO    @qs, @qe
                IF @qs > @me
                BEGIN
                        INSERT
                        INTO    @t
                        VALUES (@ms, @me)
                        SET @ms = @qs
                END
                SET @me = CASE WHEN @qe > @me THEN @qe ELSE @me END
        END
        IF @ms IS NOT NULL 
        BEGIN
                INSERT
                INTO    @t
                VALUES  (@ms, @me)
        END
        CLOSE   cr_span
        RETURN
END

由于SQL Server 缺乏一种简单的方法来引用结果集中先前选择的行，这是SQL Server 中的游标比基于集合的决策工作得更快的罕见情况之一。

在1,440,000 行上测试，整套工作时间为24 秒，在一两天内几乎是即时的。

注意SELECT 查询中的附加条件：

s_date BETWEEN @p_from - 1 AND @p_till

这似乎是多余的，但它实际上是一个粗略的过滤器，可以使您在s_date 上的索引可用。

【讨论】：

@David：只要两个相邻的时间跨度不相交，它就会重新调整 grouper，因此所有相交的时间跨度都归为一组。然后返回每个组的 MIN 和 MAX 日期。

【解决方案3】：

关于这里的类似问题：

Min effective and termdate for contiguous dates

FWIW 我投票赞成推荐 Joe Celko 的 SQL For Smarties，第三版 -- 重复：第三版 (2005) -- 讨论了各种方法、设置基础和程序。

【讨论】：

谢谢你的提示——尤其是书的提示——这对我来说是必读的:)

【解决方案4】：

假设你：

有某种简单的自定义 Date 对象，用于存储开始日期/时间和结束日期/时间
以这些日期的列表 L 的形式返回按排序顺序（按开始日期/时间）的行
想要创建一个扁平的日期列表，F

执行以下操作：

first = first row in L
flat_date.start = first.start, flat_date.end = first.end
For each row in L:
    if row.start < flat_date.end and row.end > flat_date.end: // adding on to a timespan
        flat_date.end = row.end
    else: // ending a timespan and starting a new one
        add flat_date to F
        flat_date.start = row.start, flat_date.end = row.end
add flat_date to F // adding the last timespan to the flattened list

【讨论】：

谢谢，看起来很有希望。我会试一试。

【解决方案5】：

这是一个递归 CTE 解决方案，但我冒昧地为每一列分配了日期和时间，而不是单独提取日期。有助于避免一些杂乱的特殊情况代码。如果您必须单独存储日期，我会使用 CTE 视图使其看起来像两个日期时间列并采用这种方法。

创建测试数据：

create table t1 (d1 datetime, d2 datetime)

insert t1 (d1,d2)
    select           '2009-06-03 10:00:00', '2009-06-03 14:00:00'
    union all select '2009-06-03 13:55:00', '2009-06-03 18:00:00'
    union all select '2009-06-03 17:55:00', '2009-06-03 23:00:00'
    union all select '2009-06-03 22:55:00', '2009-06-04 03:00:00'

    union all select '2009-06-04 03:05:00', '2009-06-04 07:00:00'

    union all select '2009-06-04 07:05:00', '2009-06-04 10:00:00'
    union all select '2009-06-04 09:55:00', '2009-06-04 14:00:00'

递归 CTE：

;with dateRanges (ancestorD1, parentD1, d2, iter) as
(
--anchor is first level of collapse
    select
        d1 as ancestorD1,
        d1 as parentD1,
        d2,
        cast(0 as int) as iter
    from t1

--recurse as long as there is another range to fold in
    union all select
        tLeft.ancestorD1,
        tRight.d1 as parentD1,
        tRight.d2,
        iter + 1  as iter
    from dateRanges as tLeft join t1 as tRight
        --join condition is that the t1 row can be consumed by the recursive row
        on tLeft.d2 between tRight.d1 and tRight.d2
            --exclude identical rows
            and not (tLeft.parentD1 = tRight.d1 and tLeft.d2 = tRight.d2)
)
select
    ranges1.*
from dateRanges as ranges1
where not exists (
    select 1
    from dateRanges as ranges2
    where ranges1.ancestorD1 between ranges2.ancestorD1 and ranges2.d2
        and ranges1.d2 between ranges2.ancestorD1 and ranges2.d2
        and ranges2.iter > ranges1.iter
)

给出输出：

ancestorD1              parentD1                d2                      iter
----------------------- ----------------------- ----------------------- -----------
2009-06-04 03:05:00.000 2009-06-04 03:05:00.000 2009-06-04 07:00:00.000 0
2009-06-04 07:05:00.000 2009-06-04 09:55:00.000 2009-06-04 14:00:00.000 1
2009-06-03 10:00:00.000 2009-06-03 22:55:00.000 2009-06-04 03:00:00.000 3

【讨论】：

哇..我需要了解一下这种方法..给我一点时间:)
为什么不直接拿demo数据比较方便？
我用原始数据进行了尝试，但没有按要求工作。我无法将经过修改的代码与示例数据一起粘贴，因为它对于评论来说太长了。 :(

【解决方案6】：

为了帮助回答这个问题，这里是问题中给出的示例数据，使用了一个表格变量，例如 Hainstech：

declare @T1 table (d1 datetime, d2 datetime)

insert @T1 (d1,d2)
select           '02 June 2009 05:55:28','02 June 2009 09:58:27'
union all select '02 June 2009 10:15:19','02 June 2009 13:58:24'
union all select '02 June 2009 13:58:24','02 June 2009 13:58:43'
union all select '03 June 2009 05:54:48','03 June 2009 10:00:13'
union all select '03 June 2009 09:26:45','03 June 2009 09:59:40'
union all select '03 June 2009 10:12:50','03 June 2009 10:27:25'
union all select '03 June 2009 10:13:12','03 June 2009 11:14:56'
union all select '03 June 2009 10:27:25','03 June 2009 10:27:31'
union all select '03 June 2009 10:27:39','03 June 2009 13:53:38'
union all select '03 June 2009 11:14:56','03 June 2009 11:15:03'
union all select '03 June 2009 11:15:30','03 June 2009 14:02:14'
union all select '03 June 2009 13:53:38','03 June 2009 13:53:43'
union all select '03 June 2009 14:02:14','03 June 2009 14:02:31'
union all select '04 June 2009 05:48:27','04 June 2009 09:58:59'
union all select '04 June 2009 06:00:00','04 June 2009 09:59:07'
union all select '04 June 2009 10:15:52','04 June 2009 13:54:52'
union all select '04 June 2009 10:16:01','04 June 2009 13:24:20'
union all select '04 June 2009 13:24:20','04 June 2009 13:24:24'
union all select '04 June 2009 13:24:32','04 June 2009 14:00:39'
union all select '04 June 2009 13:54:52','04 June 2009 13:54:58'
union all select '04 June 2009 14:00:39','04 June 2009 14:00:49'
union all select '05 June 2009 05:53:58','05 June 2009 09:59:12'
union all select '05 June 2009 10:16:05','05 June 2009 13:59:08'
union all select '05 June 2009 13:59:08','05 June 2009 13:59:16'
union all select '06 June 2009 06:04:00','06 June 2009 10:00:00'
union all select '06 June 2009 10:16:54','06 June 2009 10:18:40'
union all select '06 June 2009 10:18:40','06 June 2009 10:18:45'
union all select '06 June 2009 10:23:00','06 June 2009 13:57:00'
union all select '06 June 2009 10:23:48','06 June 2009 13:57:54'
union all select '06 June 2009 13:57:21','06 June 2009 13:57:38'
union all select '06 June 2009 13:57:54','06 June 2009 13:57:58'
union all select '07 June 2009 21:59:30','07 June 2009 01:58:49'
union all select '07 June 2009 22:12:16','07 June 2009 01:58:39'
union all select '07 June 2009 22:12:25','07 June 2009 01:58:28'
union all select '08 June 2009 02:10:33','08 June 2009 05:56:11'
union all select '08 June 2009 02:10:43','08 June 2009 05:56:23'
union all select '08 June 2009 02:10:49','08 June 2009 05:55:59'
union all select '08 June 2009 05:55:59','08 June 2009 05:56:01'
union all select '08 June 2009 05:56:11','08 June 2009 05:56:14'
union all select '08 June 2009 05:56:23','08 June 2009 05:56:27'

【讨论】：

【解决方案7】：

扩展 MahlerFive 答案我为 DateTools 编写了一个快速扩展。到目前为止，它已经通过了我所有的测试。

extension DTTimePeriodCollection {

    func flatten() {

        self.sortByStartAscending()

        guard let periods = self.periods() else { return }
        if periods.count < 1 { return }

        var flattenedPeriods = [DTTimePeriod]()
        let flatdate = DTTimePeriod()

        for period in periods {

            guard let periodStart = period.StartDate, let periodEnd = period.EndDate else { continue }

            if !flatdate.hasStartDate() { flatdate.StartDate = periodStart }
            if !flatdate.hasEndDate() { flatdate.EndDate = periodEnd }

            if periodStart.isEarlierThanOrEqualTo(flatdate.EndDate) && periodEnd.isGreaterThanOrEqualTo(flatdate.EndDate) {

                flatdate.EndDate = periodEnd

            } else {

                flattenedPeriods.append(flatdate.copy())
                flatdate.StartDate = periodStart
                flatdate.EndDate = periodEnd
            }
        }

        flattenedPeriods.append(flatdate.copy())

        // delete all periods
        for var i = 0 ; i < periods.count ; i++ { self.removeTimePeriodAtIndex(0) }

        // add flattened periods to self
        for flat in flattenedPeriods { self.addTimePeriod(flat) }
    }

【讨论】：