这并不简单,但可以在一个查询中完成。
困难的部分是将一组间隔组合成最大可能的连续间隔。解决方案详见this post。
要获得您想要的结果,您现在需要:
- 使用链接中给出的查询计算 col1 中每个值的最大可能连续间隔。
根据您的示例值,结果将是:
col_1 lower_bound upper_bound
a 20 60
b 12 15
b 31 50
c 100 300
-
将这些大间隔之一与your_table 中的每一行相关联。每行只能有一个这样的间隔,所以让我们INNER JOIN:
SELECT my_table.*, large_intervals.lower_bound, large_intervals.upper_bound
FROM my_table
INNER JOIN (my_awesome_query(your_table)) large_intervals
ON large_intervals.col1 = my_table.col1
AND large_intervals.lower_bound <= my_table.col2
AND large_intervals.upper_bound >= my_table.col3
你会得到:
col1 col2 col3 col4 col5 lower_bound upper_bound
a 45 50 1 0 20 60
a 50 61 6 0 20 60
a 20 45 0 5 20 60
b 31 50 0 1 31 50
b 12 15 5 0 12 15
c 100 200 3 2 100 300
c 150 300 1 2 100 300
- 那么很简单,只需按 col1、lower_bound、upper bound 分组即可:
SELECT col1, lower_bound AS col2, upper_bound AS col3, MAX(col4) AS col4, MAX(col5) AS col5 FROM (query above) decorated_table GROUP BY col1, lower_bound, upper_bound
你会得到你想要的结果。
回到困难的部分:上面提到的帖子公开了 PostgreSQL 的解决方案。 MySQL 没有范围类型,但可以调整解决方案。例如,代替lower(range),直接使用下限col2。该解决方案还使用了窗口函数,即lag 和lead,但MySQL 支持with the same syntax,所以这里没有问题。另请注意,他们使用COALESCE(upper(range), 'infinity') 来防范未绑定的范围。由于你的范围是有限的,你不需要关心这个,你可以直接使用上限,即col3。这是改编:
WITH a AS (
SELECT
col2,
col3,
col2 AS lower_bound,
MAX(col3) OVER (ORDER BY col2, col3) AS upper_bound
FROM combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (ORDER BY col2, col3) AS grp
FROM b
)
SELECT
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS range
FROM c
GROUP BY grp
ORDER BY 1;
这适用于单个组。如果你想通过 col1 获取范围,你可以像这样调整它:
WITH a AS (
SELECT
col1,
col2,
col3,
col2 AS lower_bound,
MAX(col3) OVER (PARTITION BY col1 ORDER BY col2, col3) AS upper_bound
FROM combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (PARTITION BY col1 ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (PARTITION BY col1 ORDER BY col2, col3) AS grp
FROM b
)
SELECT
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS range
FROM c
GROUP BY col1, grp
ORDER BY 1;
结合所有内容,我们得到以下结果(在您提供的示例上进行了测试),它完全返回了您期望的输出:
WITH a AS (
SELECT
col1,
col2,
col3,
col2 AS lower_bound,
MAX(col3) OVER (PARTITION BY col1 ORDER BY col2, col3) AS upper_bound
FROM combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (PARTITION BY col1 ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (PARTITION BY col1 ORDER BY col2, col3) AS grp
FROM b
)
, large_intervals AS (
SELECT
col1,
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS upper_bound
FROM c
GROUP BY col1, grp
ORDER BY 1
)
, combine_with_large_interval AS (
SELECT
combine.*,
large_intervals.lower_bound,
large_intervals.upper_bound
FROM combine
INNER JOIN large_intervals
ON large_intervals.col1 = combine.col1
AND large_intervals.lower_bound <= combine.col2
AND large_intervals.upper_bound >= combine.col3
)
SELECT
col1,
lower_bound AS col2,
upper_bound AS col3,
MAX(col4) AS col4,
MAX(col5) AS col5
FROM combine_with_large_interval
GROUP BY col1, lower_bound, upper_bound
ORDER BY col1, col2, col3;
瞧!