【问题标题】:SQL Challenge/Puzzle: How to merge nested ranges?SQL Challenge/Puzzle:如何合并嵌套范围?
【发布时间】:2017-02-22 13:53:30
【问题描述】:
  • 此挑战基于涉及 IP 范围的真实用例。
  • 我提供的解决方案基于我之前提出的stack trace 挑战。每个范围开始都被视为 PUSH 操作,每个范围结束 + 1 被视为 POP 操作。

挑战

我们有一个范围数据集,其中每个范围都有一个起点、终点和一个值。

create table ranges
(
    range_start     int         not null
   ,range_end       int         not null
   ,range_val       char(1)     not null
)
;

一个范围可以包含另一个范围或跟随另一个范围,但不能等于另一个范围或与另一个范围相交。

这些是范围之间的有效关系:

(1)           (2)           (3)           (4)
---------     ---------     ---------     -------- -----------
---                 ---        ---

这些关系无效

(5)                (6)
-------        --------       
-------              --------

我们的初始范围,当以图形方式显示时,可能看起来像这样(字母代表 range_val):

AAAAAAAA  BBCCCCCCC
 DDE   F   GGGGG
   H       IIII
             J

目标是获取初始范围集并根据以下规则创建一个新集:

包含范围将覆盖包含范围的相应子范围。

请求的结果,当以图形方式呈现时,可能看起来像这样

ADDHAAAF  BIIJIGCCC

要求

  • 解决方案应该是单个 SQL 查询(子查询很好)。
  • 使用 T-SQL、PL/SQL 等不允许
  • UDF(用户定义函数)的使用不允许

数据样本

AAAAAAAAAAAAAAAAAAAAAAAAAAAA  BBBB    CCCCCCCCCCCCCCCCCCCCCCCCC
DDDE  FFFFFFFF    GGGGGGGGG               HHHHHHHH    IIIIIII
JJ      KKKLLL       MM NN                              OOOOO
            P                                              QQ

insert into ranges (range_start,range_end,range_val) values (1  ,28 ,'A');
insert into ranges (range_start,range_end,range_val) values (31 ,34 ,'B');
insert into ranges (range_start,range_end,range_val) values (39 ,63 ,'C');
insert into ranges (range_start,range_end,range_val) values (1  ,3  ,'D');
insert into ranges (range_start,range_end,range_val) values (4  ,4  ,'E');
insert into ranges (range_start,range_end,range_val) values (7  ,14 ,'F');
insert into ranges (range_start,range_end,range_val) values (19 ,27 ,'G');
insert into ranges (range_start,range_end,range_val) values (43 ,50 ,'H');
insert into ranges (range_start,range_end,range_val) values (55 ,61 ,'I');
insert into ranges (range_start,range_end,range_val) values (1  ,2  ,'J');
insert into ranges (range_start,range_end,range_val) values (9  ,11 ,'K');
insert into ranges (range_start,range_end,range_val) values (12 ,14 ,'L');
insert into ranges (range_start,range_end,range_val) values (22 ,23 ,'M');
insert into ranges (range_start,range_end,range_val) values (25 ,26 ,'N');
insert into ranges (range_start,range_end,range_val) values (57 ,61 ,'O');
insert into ranges (range_start,range_end,range_val) values (13 ,13 ,'P');
insert into ranges (range_start,range_end,range_val) values (60 ,61 ,'Q');

要求的结果

(Null 在此处显示为空格)

JJDEAAFFKKKLPLAAAAGGGMMGNNGA  BBBB    CCCCHHHHHHHHCCCCIIOOOQQCC

range_start range_end range_val
----------- --------- ---------
1           2          J
3           3          D
4           4          E
5           6          A
7           8          F
9           11         K
12          12         L
13          13         P
14          14         L
15          18         A
19          21         G
22          23         M
24          24         G
25          26         N
27          27         G
28          28         A
29          30         
31          34         B
35          38         
39          42         C
43          50         H
51          54         C
55          56         I
57          59         O
60          61         Q
62          63         C

可选添加最后一行:

64

【问题讨论】:

  • 请编辑您的问题以仅包含相关标签。
  • @ZoharPeled,标签已删除。
  • 除 SQL 之外的任何建议标签?
  • 是的,无论您实际使用的是什么 rdbms。
  • 我已标记的所有内容。 Teradata、Oracle、SQL Server、PostgresSQL 和 Hive。我们在另一篇文章中进行了类似的讨论,如您所见,除了供应商特定的解决方案外,我已经为所有这些数据库提交了一个通用解决方案。 stackoverflow.com/a/39941615/6336479

标签: sql sql-server oracle hive teradata


【解决方案1】:

Oracle 解决方案:

with l as ( select level lvl from dual connect by level < 66 ),
     r as ( select range_start r1, range_end r2, range_val v, 
                    range_end - range_start + 1 cnt 
              from ranges ),
     t1 as (select distinct lvl, 
                   nvl(max(v) keep (dense_rank first order by cnt) 
                              over (partition by lvl), '*' ) m
              from l left join r on lvl between r1 and r2 ),
     t2 as (select lvl, m, case when lag(m) over (order by lvl) <> m then 0 else 1 end mrk 
              from t1),
     t3 as (select lvl, m, lvl - sum(mrk) over (order by lvl) grp from t2)
select min(lvl) r1, max(lvl) r2, nullif(min(m), '*') val
  from t3 group by grp order by r1

按要求输出。我的英语还差得很远,所以很难解释,但让我们试试吧:

  • l - 数字生成器,
  • r - 来自ranges 的数据,计算距离,
  • t1 - 找到每个 lvl 距离最小的值,
  • t2 - 添加标记,告知范围是否开始,
  • t3 - 添加我们接下来将用于的列 分组数据。

【讨论】:

  • 感谢您的努力和挑战。您的解决方案确实返回了正确的结果,但是其在整个范围内创建每个点的基本概念显然效率低下。如果我的数据样本只包含一个范围,但范围很广,例如(0,4294967295,'A') ,到此结束。我想鼓励您使用不同的方法找到另一种解决方案。再次感谢:-)
  • 关于 t1 的评论:我的观点是聚合在这里更合适。 select lvl ,nvl(max(v) keep (dense_rank first order by cnt), '*' ) m from l left join r on lvl between r1 and r2 group by lvl 怎么样;?跨度>
  • 关于 t1 及其后面的代码:查找连续值只需 2 个步骤(t1 + 附加步骤)。请看一下这个建议: t1 as ( select lvl ,max(v) keep (dense_rank first order by cnt) m ,row_number () over (partition by max(v) keep (dense_rank first order by cnt) order by lvl) as rn from l left join r on lvl between r1 and r2 group by lvl) select min(lvl) range_start,max(lvl) range_end,m range_val from t1 group by m,lvl - rn order by range_start ;
【解决方案2】:
  • 该解决方案基于我之前提出的stack trace 挑战。每个范围开始都被视为 PUSH 操作,每个范围结束 + 1 被视为 POP 操作。
  • 就性能而言,您可能会注意到 2 个内部分析函数如何使用相同的窗口,因此是在一个步骤中执行的。

Teradata

select      new_range_start
           ,new_range_end

           ,last_value (range_val ignore nulls) over 
            (
                partition by    stack_depth
                order by        new_range_start ,range_start ,range_end desc 
                rows            unbounded preceding
            )                                                                   as new_range_val

from       (select      new_range_start
                       ,range_val
                       ,range_start
                       ,range_end

                       ,sum (case when range_val is null then -1 else 1 end) over 
                        (
                            order by    new_range_start, range_start ,range_end desc  
                            rows        unbounded preceding
                        )                                                                   as stack_depth

                       ,min (new_range_start) over
                        (
                            order by    new_range_start ,range_start ,range_end desc
                            rows        between 1 following and 1 following

                        ) - 1                                                               as new_range_end

            from        (           select range_start     ,range_start ,range_end ,range_val              from ranges
                        union all   select range_end   + 1 ,range_start ,range_end ,cast (null as char(1)) from ranges
                        )
                        r (new_range_start,range_start,range_end,range_val)
            )
            r

qualify     new_range_end >= new_range_start

order by    new_range_start
;

甲骨文

select      new_range_start
           ,new_range_end
           ,new_range_val                       

from       (select      new_range_start
                       ,new_range_end

                       ,last_value (range_val ignore nulls) over 
                        (
                            partition by    stack_depth
                            order by        new_range_start ,range_start ,range_end desc 
                            rows            unbounded preceding
                        )                                                                   as new_range_val


            from       (select      new_range_start
                                   ,range_start
                                   ,range_end
                                   ,range_val

                                   ,sum (case when range_val is null then -1 else 1 end) over 
                                    (
                                        order by    new_range_start, range_start ,range_end desc  
                                        rows        unbounded preceding
                                    )                                                                as stack_depth

                                   ,lead (new_range_start) over
                                    (
                                        order by    new_range_start, range_start ,range_end desc 
                                    ) - 1                                                            as new_range_end

                        from        (           select range_start     as new_range_start ,range_start ,range_end ,range_val              from ranges
                                    union all   select range_end   + 1                    ,range_start ,range_end ,cast (null as char(1)) from ranges
                                    )
                                    r 
                        )
                        r
            )
            r

where       new_range_end >= new_range_start

order by    new_range_start
;

SQL Server / PostgreSQL / Hive

select      *

from       (select      new_range_start
                       ,new_range_end
                       ,min (range_val) over
                        (
                            partition by    stack_depth,new_range_val_group_id
                        )                                                       as new_range_val                       

            from       (select      new_range_start
                                   ,new_range_end
                                   ,range_val
                                   ,stack_depth

                                   ,count (range_val) over 
                                    (
                                        partition by    stack_depth
                                        order by        new_range_start ,range_start ,range_end desc 
                                        rows            unbounded preceding
                                    )                                                                   as new_range_val_group_id


                        from       (select      new_range_start
                                               ,range_start
                                               ,range_end
                                               ,range_val

                                               ,sum (case when range_val is null then -1 else 1 end) over 
                                                (
                                                    order by    new_range_start, range_start ,range_end desc  
                                                    rows        unbounded preceding
                                                )                                                                as stack_depth

                                               ,lead (new_range_start) over
                                                (
                                                    order by    new_range_start, range_start ,range_end desc 
                                                ) - 1                                                            as new_range_end

                                    from        (           select range_start     as new_range_start ,range_start ,range_end ,range_val                           from ranges
                                                union all   select range_end   + 1 as new_range_start ,range_start ,range_end ,cast (null as char(1)) as range_val from ranges
                                                )
                                                r 
                                    )
                                    r
                        )
                        r
            )
            r

where       new_range_end >= new_range_start

order by    new_range_start
;

【讨论】:

  • 问题看起来类似于Packing Intervals,但我不知道如何获取值,而不是简单的堆栈深度计数。非常好。
  • @VladimirBaranov: 问题其实和sqlmag.com/sql-server/packing-intervals-priorities是一样的
  • @dnoeth,我看了一下链接,发现范围之间的关系没有限制,这使得我提出的问题是这个问题的一个私人案例(有很多更清洁的解决方案)
【解决方案3】:

Oracle 解决方案 2

 WITH borders AS /*get all borders of interval*/ 
  (SELECT DISTINCT DECODE(is_end, 0, range_start, range_end) AS border 
                  ,is_end 
   FROM   ranges r, 
          (SELECT 0 AS is_end FROM dual UNION ALL 
           SELECT 1 AS is_end FROM dual)), 
 interv AS  /*get all intervals*/ 
  (SELECT border + is_end AS beg_int 
         ,lead(border) over(ORDER BY border, is_end ) 
           - lead(DECODE(is_end, 0, 1, 0)) over(ORDER BY border, is_end) AS end_int 
   FROM   borders 
   ORDER  BY 1) 
 SELECT i.beg_int 
       ,i.end_int 
       ,(SELECT MAX(r.range_val) keep (dense_rank FIRST ORDER BY r.range_end - r.range_start) 
       FROM ranges r 
       WHERE i.beg_int >= r.range_start AND i.end_int <= r.range_end) AS range_val   
 FROM   interv i 
 WHERE  beg_int <= end_int OR end_int IS NULL 
 ORDER  BY i.beg_int; 

添加没有自加入的解决方案: 编辑:修复缺陷。

 WITH intervals AS 
  (SELECT DECODE(is_end, -1, range_val, NULL) AS range_val 
         ,DECODE(is_end, -1, range_start, range_end) AS border 
         ,is_end 
         ,- (SUM(is_end) over(ORDER BY DECODE(is_end, -1, range_start, range_end), is_end, (range_end - range_start) * is_end)) AS poss 
         ,(range_end - range_start) * is_end AS ord2 
   FROM   ranges r 
         ,(SELECT -1 AS is_end FROM   dual UNION ALL 
           SELECT 1  AS is_end FROM   dual)), 
 range_stack AS 
  (SELECT border + DECODE(is_end, 1, 1, 0) AS begin_int 
         ,lead(border) over(ORDER BY border, is_end, ord2) 
           + DECODE(lead(is_end) over(ORDER BY border, is_end, ord2), 1, 0, -1) AS end_int 
         ,last_value(range_val ignore NULLS) over(PARTITION BY poss ORDER BY border, is_end, ord2) AS range_val 
   FROM   intervals) 
 SELECT begin_int 
       ,end_int 
       ,range_val 
 FROM   range_stack 
 WHERE  end_int >= begin_int 
        OR end_int IS NULL;

【讨论】:

  • 嗨,Michael :-) 这绝对是正确的方向,并且编码良好,但是,从性能方面来说,如果我们'重新处理大量范围。
  • 左连接需要空间隔。左连接有什么问题?可能的计划之一:Oracle 一次构建所有间隔,并对它进行排序,在它从表中排序 renges 并使用间隙进行合并连接之后。如果会有很多数据,这将是一种有效的方式。
  • 问题不在于 LEFT,而在于 JOIN,这将我们带到 O(n^2) 的复杂性。我们可以用这个数据样本来证明这一点:create table range (range_start,range_end,range_val) as with t (n) as (select level from dual connect by level 输入很小,只有10K行,但是JOIN产生 100M 行。在我的笔记本电脑上,执行时间约为 2 分钟。解释计划显示了 MERGE JOIN 的使用。如果我们将使用 100K 行的初始集合,则连接将产生 100G 行...
  • 好的,我明白了。我可以将其重写为使用子查询。但在您的示例中,第一个区间将被合并并与 2 个范围进行比较。第二次与第三次依此类推。它将增加到sentral间隔(-1,1)并进一步降低。我想总计数会比你写的要低。但是会很多
  • 似乎还有一个错误。您可以使用以下数据示例查看它:create table ranges (range_start,range_end,range_val) as with t (n) as (select level from dual connect by level
猜你喜欢
  • 2017-02-19
  • 1970-01-01
  • 1970-01-01
  • 2017-02-16
  • 1970-01-01
  • 2011-01-01
  • 2010-12-18
  • 2018-05-23
  • 1970-01-01
相关资源
最近更新 更多