【发布时间】:2020-11-25 05:04:06
【问题描述】:
假设我们有一个候选人表现表
CREATE TABLE IF NOT EXISTS candidates AS
WITH RECURSIVE candidates(team, score) AS (
SELECT RANDOM() % 1000, RANDOM() % 1000000
UNION
SELECT RANDOM() % 1000, RANDOM() % 1000000
FROM candidates
LIMIT 1000000
)
SELECT team, score
FROM candidates;
我们的目标是输出 1000 个团队的列表以及该团队中候选人的总分。但是,如果一支球队的总比分不在上半场,则将其替换为零。我想出了两种方法来做到这一点:
- 使用
EXISTS,需要Run Time: real 30.653 user 30.635649 sys 0.008798
WITH top_teams_verbose(top_team, total_score) AS (
SELECT team, SUM(score)
FROM candidates
GROUP BY team
ORDER BY 2 DESC
LIMIT 500
)
SELECT team, SUM(score) * EXISTS(SELECT 1 FROM top_teams_verbose WHERE team = top_team)
FROM candidates
GROUP BY team;
查询计划
QUERY PLAN
|--SCAN TABLE candidates
|--USE TEMP B-TREE FOR GROUP BY
`--CORRELATED SCALAR SUBQUERY 2
|--CO-ROUTINE 1
| |--SCAN TABLE candidates
| |--USE TEMP B-TREE FOR GROUP BY
| `--USE TEMP B-TREE FOR ORDER BY
`--SCAN SUBQUERY 1
- 使用
IN,需要Run Time: real 0.045 user 0.041872 sys 0.002999
WITH top_teams_verbose(top_team, total_score) AS (
SELECT team, SUM(score)
FROM candidates
GROUP BY team
ORDER BY 2 DESC
LIMIT 500
),
top_teams AS (
SELECT top_team
FROM top_teams_verbose
)
SELECT team, SUM(score) * (team IN top_teams)
FROM candidates
GROUP BY team;
查询计划
QUERY PLAN
|--SCAN TABLE candidates
|--USE TEMP B-TREE FOR GROUP BY
`--LIST SUBQUERY 3
|--CO-ROUTINE 1
| |--SCAN TABLE candidates
| |--USE TEMP B-TREE FOR GROUP BY
| `--USE TEMP B-TREE FOR ORDER BY
`--SCAN SUBQUERY 1
为什么会这样?也许EXISTS 对每一行都执行,而IN 用作聚合函数?我查看了查询计划,唯一的区别(CORRELATED SCALAR SUBQUERY 与 LIST SUBQUERY)太抽象而无法提供信息。
我在 RHEL 7 上使用 SQLite3 版本 3.31.1 2020-01-27 19:55:54 3bfa9cc97da10598521b342961df8f5f68c7388fa117345eeb516eaa837bb4d6。
【问题讨论】:
-
很确定你的猜测是正确的——第一个对每行执行一次存在查询,另一个只需要计算一次匹配行的列表并在其中为每个查询查找条目行。
-
candidates(team)上的索引将对两者都有很大帮助,顺便说一句。 -
嗨@Shawn,实际上
candidates(team)上的索引使查询时间延长了5 倍(即使已执行ANALYZE;),而candidates(team, score)上的覆盖索引确实有帮助。请参阅 gist.github.com/nalzok/174c2fe365fb8729a4392aef63348fe0 了解我的基准脚本及其在三个不同平台上的输出。
标签: sql sqlite optimization query-optimization aggregate-functions