【发布时间】:2015-10-13 14:01:18
【问题描述】:
我想将 IP 路由表信息加入 IP whois 信息。我正在使用 Amazon 的 RDS,这意味着我不能使用 Postgres ip4r 扩展,因此我使用 int8range 类型来表示 IP 地址范围,并使用 gist 索引。
我的表格如下所示:
=> \d routing_details
Table "public.routing_details"
Column | Type | Modifiers
----------+-----------+-----------
asn | text |
netblock | text |
range | int8range |
Indexes:
"idx_routing_details_netblock" btree (netblock)
"idx_routing_details_range" gist (range)
=> \d netblock_details
Table "public.netblock_details"
Column | Type | Modifiers
------------+-----------+-----------
range | int8range |
name | text |
country | text |
source | text |
Indexes:
"idx_netblock_details_range" gist (range)
完整的 routing_details 表包含不到 60 万行,而 netblock_details 包含大约 825 万行。两个表中有重叠的范围,但对于 routing_details 表中的每个范围,我想从 netblock_details 表中获得单个最佳(最小)匹配。
我想出了 2 个不同的查询,我认为它们会返回准确的数据,一个使用窗口函数,一个使用 DISTINCT ON:
EXPLAIN SELECT DISTINCT ON (r.netblock) *
FROM routing_details r JOIN netblock_details n ON r.range <@ n.range
ORDER BY r.netblock, upper(n.range) - lower(n.range);
QUERY PLAN
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Unique (cost=118452809778.47..118477166326.22 rows=581300 width=91)
Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, r.netblock, ((upper(n.range) - lower(n.range)))
-> Sort (cost=118452809778.47..118464988052.34 rows=4871309551 width=91)
Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, r.netblock, ((upper(n.range) - lower(n.range)))
Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
-> Nested Loop (cost=0.00..115920727265.53 rows=4871309551 width=91)
Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, r.netblock, (upper(n.range) - lower(n.range))
Join Filter: (r.range <@ n.range)
-> Seq Scan on public.routing_details r (cost=0.00..11458.96 rows=592496 width=43)
Output: r.asn, r.netblock, r.range
-> Materialize (cost=0.00..277082.12 rows=8221675 width=48)
Output: n.range, n.name, n.country
-> Seq Scan on public.netblock_details n (cost=0.00..163712.75 rows=8221675 width=48)
Output: n.range, n.name, n.country
(14 rows) -> Seq Scan on netblock_details n (cost=0.00..163712.75 rows=8221675 width=48)
EXPLAIN VERBOSE SELECT * FROM (
SELECT *, ROW_NUMBER() OVER (PARTITION BY r.range ORDER BY UPPER(n.range) - LOWER(n.range)) AS rank
FROM routing_details r JOIN netblock_details n ON r.range <@ n.range
) a WHERE rank = 1 ORDER BY netblock;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------
Sort (cost=118620775630.16..118620836521.53 rows=24356548 width=99)
Output: a.asn, a.netblock, a.range, a.range_1, a.name, a.country, a.rank
Sort Key: a.netblock
-> Subquery Scan on a (cost=118416274956.83..118611127338.87 rows=24356548 width=99)
Output: a.asn, a.netblock, a.range, a.range_1, a.name, a.country, a.rank
Filter: (a.rank = 1)
-> WindowAgg (cost=118416274956.83..118550235969.49 rows=4871309551 width=91)
Output: r.asn, r.netblock, r.range, n.range, n.name, n.country, row_number() OVER (?), ((upper(n.range) - lower(n.range))), r.range
-> Sort (cost=118416274956.83..118428453230.71 rows=4871309551 width=91)
Output: ((upper(n.range) - lower(n.range))), r.range, r.asn, r.netblock, n.range, n.name, n.country
Sort Key: r.range, ((upper(n.range) - lower(n.range)))
-> Nested Loop (cost=0.00..115884192443.90 rows=4871309551 width=91)
Output: (upper(n.range) - lower(n.range)), r.range, r.asn, r.netblock, n.range, n.name, n.country
Join Filter: (r.range <@ n.range)
-> Seq Scan on public.routing_details r (cost=0.00..11458.96 rows=592496 width=43)
Output: r.asn, r.netblock, r.range
-> Materialize (cost=0.00..277082.12 rows=8221675 width=48)
Output: n.range, n.name, n.country
-> Seq Scan on public.netblock_details n (cost=0.00..163712.75 rows=8221675 width=48)
Output: n.range, n.name, n.country
(20 rows)
DISTINCT ON 似乎更有效,所以我继续使用那个。当我对完整数据集运行查询时,大约等待 24 小时后出现磁盘空间不足错误。我创建了一个 routing_details_small 表,其中包含完整 routing_details 表的 N 行的子集,以尝试了解发生了什么。
N=1000
=> EXPLAIN ANALYZE SELECT DISTINCT ON (r.netblock) *
-> FROM routing_details_small r JOIN netblock_details n ON r.range <@ n.range
-> ORDER BY r.netblock, upper(n.range) - lower(n.range);
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=4411888.68..4453012.20 rows=999 width=90) (actual time=124.094..133.720 rows=999 loops=1)
-> Sort (cost=4411888.68..4432450.44 rows=8224705 width=90) (actual time=124.091..128.560 rows=4172 loops=1)
Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
Sort Method: external sort Disk: 608kB
-> Nested Loop (cost=0.41..1780498.29 rows=8224705 width=90) (actual time=0.080..101.518 rows=4172 loops=1)
-> Seq Scan on routing_details_small r (cost=0.00..20.00 rows=1000 width=42) (actual time=0.007..1.037 rows=1000 loops=1)
-> Index Scan using idx_netblock_details_range on netblock_details n (cost=0.41..1307.55 rows=41124 width=48) (actual time=0.063..0.089 rows=4 loops=1000)
Index Cond: (r.range <@ range)
Total runtime: 134.999 ms
(9 rows)
N=100000
=> EXPLAIN ANALYZE SELECT DISTINCT ON (r.netblock) *
-> FROM routing_details_small r JOIN netblock_details n ON r.range <@ n.range
-> ORDER BY r.netblock, upper(n.range) - lower(n.range);
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=654922588.98..659034941.48 rows=200 width=144) (actual time=28252.677..29487.380 rows=98992 loops=1)
-> Sort (cost=654922588.98..656978765.23 rows=822470500 width=144) (actual time=28252.673..28926.703 rows=454856 loops=1)
Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
Sort Method: external merge Disk: 64488kB
-> Nested Loop (cost=0.41..119890431.75 rows=822470500 width=144) (actual time=0.079..24951.038 rows=454856 loops=1)
-> Seq Scan on routing_details_small r (cost=0.00..1935.00 rows=100000 width=96) (actual time=0.007..110.457 rows=100000 loops=1)
-> Index Scan using idx_netblock_details_range on netblock_details n (cost=0.41..725.96 rows=41124 width=48) (actual time=0.067..0.235 rows=5 loops=100000)
Index Cond: (r.range <@ range)
Total runtime: 29596.667 ms
(9 rows)
N=250000
=> EXPLAIN ANALYZE SELECT DISTINCT ON (r.netblock) *
-> FROM routing_details_small r JOIN netblock_details n ON r.range <@ n.range
-> ORDER BY r.netblock, upper(n.range) - lower(n.range);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=1651822953.55..1662103834.80 rows=200 width=144) (actual time=185835.443..190143.266 rows=247655 loops=1)
-> Sort (cost=1651822953.55..1656963394.18 rows=2056176250 width=144) (actual time=185835.439..188779.279 rows=1103850 loops=1)
Sort Key: r.netblock, ((upper(n.range) - lower(n.range)))
Sort Method: external merge Disk: 155288kB
-> Nested Loop (cost=0.28..300651962.46 rows=2056176250 width=144) (actual time=19.325..177403.913 rows=1103850 loops=1)
-> Seq Scan on netblock_details n (cost=0.00..163743.05 rows=8224705 width=48) (actual time=0.007..8160.346 rows=8224705 loops=1)
-> Index Scan using idx_routing_details_small_range on routing_details_small r (cost=0.28..22.16 rows=1250 width=96) (actual time=0.018..0.018 rows=0 loops=8224705)
Index Cond: (range <@ n.range)
Total runtime: 190413.912 ms
(9 rows)
针对具有 600k 行的完整表,查询在大约 24 小时后失败,并出现有关磁盘空间不足的错误,这可能是由外部合并步骤引起的。所以这个查询在一个小的 routing_details 表上运行得很好而且很快,但是扩展性很差。
关于如何改进我的查询的建议,或者甚至我可以进行的架构更改,以便此查询能够在整个数据集上有效地工作?
【问题讨论】:
-
您在填充这些表后是否运行了
ANALYZE routing_details; ANALYZE netblock_details?另外:当您(暂时)省略ORDER BY时,查询计划会发生什么? -
"external sort Disk: 608kB",你对work_mem的设置是什么?您需要更多内存以避免磁盘排序。 SET work_mem TO '....';
-
@joop 我没有,但只是再次尝试并得到相同的结果。当我省略订单时,这里是查询计划gist.github.com/coderholic/92ae27c405cd25e45fa6 - 它很快,但结果不正确
-
@FrankHeikens 使用 RDS 默认值 1MB。更改为 100MB,这使得前几个示例查询更快,但仍然没有足够的内存用于完整数据集
-
我的猜测是
, upper(n.range) - lower(n.range);被视为一组函数,使其无法被外部查询索引。尝试在内部查询中计算它可能有助于/避免排序。 (避免外部查询中的SELECT *也有帮助)
标签: sql postgresql amazon-web-services amazon-rds