【问题标题】:PostgreSQL query to delete records with overlapping times while preserving the earliest?PostgreSQL查询删除重叠时间的记录,同时保留最早的?
【发布时间】:2017-12-19 05:49:24
【问题描述】:

我正在尝试找出一种方法来删除具有重叠时间的记录,但我无法找到一种简单而优雅的方法来保留所有但一个重叠的记录。这个问题类似于this one,但有一些不同。我们的表格如下所示:

╔════╤═══════════════════════════════════════╤══════════════════════════════════════╤════════╤═════════╗
║ id │ start_time                            │ end_time                             │ bar    │ baz     ║
╠════╪═══════════════════════════════════════╪══════════════════════════════════════╪════════╪═════════╣
║ 0  │ Mon, 18 Dec 2017 16:08:33 UTC +00:00  │ Mon, 18 Dec 2017 17:08:33 UTC +00:00 │ "ham"  │ "eggs"  ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 1  │ Mon, 18 Dec 2017 16:08:32 UTC +00:00  │ Mon, 18 Dec 2017 17:08:32 UTC +00:00 │ "ham"  │ "eggs"  ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 2  │ Mon, 18 Dec 2017 16:08:31 UTC +00:00  │ Mon, 18 Dec 2017 17:08:31 UTC +00:00 │ "spam" │ "bacon" ║
╟────┼───────────────────────────────────────┼──────────────────────────────────────┼────────┼─────────╢
║ 3  │ Mon, 18 Dec 2017 16:08:30 UTC +00:00  │ Mon, 18 Dec 2017 17:08:30 UTC +00:00 │ "ham"  │ "eggs"  ║
╚════╧═══════════════════════════════════════╧══════════════════════════════════════╧════════╧═════════╝

在上面的示例中,所有记录都有重叠时间,其中 重叠 仅表示由记录的 ​​start_timeend_time(包括)定义的时间范围覆盖或延伸到另一个记录的一部分记录。然而,对于这个问题,我们不仅对那些具有重叠时间的记录感兴趣,而且对匹配的 barbaz 列(上面的第 0、1 和 3 行)感兴趣。找到这些记录后,我们想删除除最早的以外的所有记录,只留下记录 2 和 3,因为记录 2 没有匹配的 barbaz 列,而 3 有并且具有最早的开始和结束次。

这是我目前所拥有的:

  delete from foos where id in (
    select
      foo_one.id
    from
      foos foo_one
    where
      user_id = 42
      and exists (
        select
          1
        from
          foos foo_two
        where
          tsrange(foo_two.start_time::timestamp, foo_two.end_time::timestamp, '[]') &&
            tsrange(foo_one.start_time::timestamp, foo_one.end_time::timestamp, '[]')
          and
            foo_one.bar = foo_two.bar
          and
            foo_one.baz = foo_two.baz
          and
            user_id = 42
          and
            foo_one.id != foo_two.id
      )
  );

感谢阅读!

更新:我找到了一个适合我的解决方案,基本上我可以将窗口函数 row_number() 应用到由 barbaz 字段分组的表分区上,然后添加 @987654336 @ 子句添加到 DELETE 语句,排除第一个条目(具有最小 id 的条目)。

  delete from foos where id in (
    select id from (
      select
          foo_one.id,
          row_number() over(partition by
                              bar,
                              baz
                            order by id asc)
        from
          foos foo_one
        where
          user_id = 42
          and exists (
            select
              *
            from
              foos foo_two
            where
              tsrange(foo_two.start_time::timestamp,
                        foo_two.end_time::timestamp,
                        '[]') &&
                tsrange(foo_one.start_time::timestamp,
                        foo_one.end_time::timestamp,
                        '[]')
              and
                foo_one.id != foo_two.id
          )
    ) foos where row_number <> 1
  );

【问题讨论】:

  • 请编辑您的问题并添加一些sample data 和基于该数据的预期输出。 Formatted textno screen shots.
  • 我很好奇为什么这被标记为 ruby​​-on-rails
  • 因为它是一个 RoR 项目,我不希望人们在上面的查询中被 ruby​​ 样式的字符串插值绊倒。
  • 我明白了。但是,您已经用 foos 掩盖了其他所有内容。那么为什么不也屏蔽字符串插值,把它变成一个非常纯粹的 postgreSQL 问题呢?

标签: sql postgresql optimization


【解决方案1】:

首先,请注意:您确实应该提供更多信息。我了解您可能不想展示您业务的一些真实专栏,但这样一来,您就很难理解您想要展示什么。

但是,我将就该主题提供一些提示。希望对您以及遇到类似问题的人有所帮助。

  1. 您需要明确定义重叠的内容。这对每个人来说可能是很多不同的事情。

看看这些事件:

<--a-->
    <---- b ---->
        <---- c ---->
          <-- d -->
            <---- e ---->
    <------- f -------->
                  <--- g --->

如果您像 google 定义一样定义 重叠延伸以覆盖部分,那么 "b","d","e" 和 "f"与“c”事件部分重叠。如果您将 overlaps 定义为完整的覆盖事件,则“c”与“d”重叠,“f”与“b”、“c”和“d”重叠。

  1. 删除组可能是个问题。在以前的情况下,我们应该怎么做?我们是否应该删除“b”、“c”和“d”而只保留“f”?我们应该总结他们的价值吗?也许取平均值?所以,这是一个要逐列做出的决定。每列的含义非常重要。所以,“bar”和“baz”我帮不了你。

  2. 所以,为了猜测你真正想要的是什么,我正在创建一个类似的事件表,其中包含 id、begin、end 和 user_id

    create table events (
      id integer,
      user_id integer,
      start_time timestamp,
      end_time timestamp,
      name varchar(100)
    );
    

我正在添加示例值

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 1, 1000, timestamp('2017-10-09 01:00:00'),timestamp('2017-10-09 04:00:00'), 'a' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 2, 1000, timestamp('2017-10-09 03:00:00'),timestamp('2017-10-09 15:00:00'), 'b' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 3, 1000, timestamp('2017-10-09 07:00:00'),timestamp('2017-10-09 19:00:00'), 'c' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 4, 1000, timestamp('2017-10-09 09:00:00'),timestamp('2017-10-09 17:00:00'), 'd' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 5, 1000, timestamp('2017-10-09 17:00:00'),timestamp('2017-10-09 23:00:00'), 'e' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 6, 1000, timestamp('2017-10-09 02:30:00'),timestamp('2017-10-09 22:00:00'), 'f' );

    insert into events
    ( id, user_id, start_time, end_time, name ) values
    ( 7, 1000, timestamp('2017-10-09 17:30:00'),timestamp('2017-10-10 02:00:00'), 'g' );

现在,我们可以玩一些不错的查询了:

列出与另一个事件完全重叠的所有事件:

select 
  # EVENT NAME
  event_1.name as event_name,
  # LIST EVENTS THAT THE EVENT OVERLAPS
  GROUP_CONCAT(event_2.name) as overlaps_names
from events as event_1
inner join events as event_2
on
  event_1.user_id = event_2.user_id
and
  event_1.id != event_2.id
and
(
    # START AFTER THE EVENT ONE
    event_2.start_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.end_time   <= event_1.end_time
)
  group by 
event_1.name

结果:

+------------+----------------+
| event_name | overlaps_names |
+------------+----------------+
| c          | d              |
| f          | b,d,c          |
+------------+----------------+

要检测部分重叠,您需要这样的东西:

select 
  # EVENT NAME
  event_1.name as event_name,
  # LIST EVENTS THAT THE EVENT OVERLAPS
  GROUP_CONCAT(event_2.name) as overlaps_names
from events as event_1
inner join events as event_2
on
  event_1.user_id = event_2.user_id
and
  event_1.id != event_2.id
and
(
  (
    # START AFTER THE EVENT ONE
    event_2.start_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.start_time <= event_1.end_time
   ) or
  (
    # START AFTER THE EVENT ONE
    event_2.end_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.end_time <= event_1.end_time
   )
)
  group by 
event_1.name

结果:

+------------+----------------+
| event_name | overlaps_names |
+------------+----------------+
| a          | b,f            |
| b          | c,d,a          |
| c          | b,d,e,g        |
| d          | b,e            |
| e          | f,g,d,c        |
| f          | a,g,b,d,c,e    |
| g          | c,e,f          |
+------------+----------------+

当然,我使用“分组依据”来使阅读更容易。如果您想求和或取重叠数据的平均值以在删除之前更新您的父数据,这也可能很有用。也许 Postgres 中不存在“group_concat”函数或具有不同的名称。您可以对其进行测试的一种“标准 SQL”是:

select 
  # EVENT NAME
  event_1.name as event_name,
  # LIST EVENTS THAT THE EVENT OVERLAPS
  event_2.name as overlaps_name
from events as event_1
inner join events as event_2
on
  event_1.user_id = event_2.user_id
and
  event_1.id != event_2.id
and
(
    # START AFTER THE EVENT ONE
    event_2.start_time >= event_1.start_time and
    #  ENDS BEFORE THE EVENT ONE
    event_2.end_time   <= event_1.end_time
)

结果:

+------------+---------------+
| event_name | overlaps_name |
+------------+---------------+
| f          | b             |
| f          | c             |
| c          | d             |
| f          | d             |
+------------+---------------+

如果您要尝试一些数学运算,请记住将“c”和“d”数据的值加到“b”上并再次将它们的值加到“f”上的风险,使“f”错了。

// should be
new f = old f + b + old c + d
new c = old c + b + d // unecessary if you are going to delete it

// very common mistake
new c = old c + b + d // unecessary but not wrong yet
new f = new c + b + d = ( old c + b + d ) + b + d // wrong!!

您可以使用此 URL http://sqlfiddle.com/#!9/1d2455/19 在线测试所有这些查询并在同一数据库中创建您自己的查询。但是,请记住,它是 Mysql,而不是 Postgresql。但是测试标准的SQL是很好的。

【讨论】:

  • StackOverflow 中有一个关于将 group_concat 转换为 Postgres stackoverflow.com/questions/2560946/… 的线程。看起来很简单。
  • 感谢您的回复!我没有最终选择这条路线,但这是一种有趣的方法。
  • 这对我无效!如果够的话,我确实投了赞成票。
猜你喜欢
  • 1970-01-01
  • 2019-12-20
  • 1970-01-01
  • 1970-01-01
  • 2020-10-12
  • 1970-01-01
  • 1970-01-01
  • 2021-11-16
  • 2013-06-14
相关资源
最近更新 更多