【发布时间】:2018-10-14 01:37:13
【问题描述】:
我的任务是在 100 维空间中快速找到最近的邻居。 所以我创建了一个测试表:
create extension cube;
create table vectors (id serial, vector cube);
insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 25000000) id;
搜索请求:
explain analyze SELECT * FROM vectors ORDER BY vector <-> '(705, 501, 321, 345, 591, 58, 229, 420, 341, 628, 84, 476, 700, 71, 815, 616, 45, 686, 886, 102, 378, 172, 263, 538, 665, 553, 475, 845, 540, 963, 893, 209, 479, 357, 914, 70, 415, 142, 490, 756, 770, 574, 232, 470, 645, 47, 86, 690, 733, 972, 792, 112, 144, 55, 650, 810, 608, 125, 655, 148, 88, 548, 357, 567, 905, 271, 637, 320, 413, 128, 76, 183, 702, 308, 653, 347, 355, 739, 37, 88, 711, 829, 200, 856, 884, 850, 665, 493, 975, 320, 641, 63, 869, 998, 630, 774, 269, 268, 94, 682)'::cube LIMIT 10;
如果没有索引,查找最近邻居的请求大约需要 30 秒。
现在我们将创建一个索引:
CREATE INDEX vectors_vector_idx ON vectors USING GIST (vector);
重复搜索请求:
explain analyze SELECT * FROM vectors ORDER BY vector <-> '(705, 501, 321, 345, 591, 58, 229, 420, 341, 628, 84, 476, 700, 71, 815, 616, 45, 686, 886, 102, 378, 172, 263, 538, 665, 553, 475, 845, 540, 963, 893, 209, 479, 357, 914, 70, 415, 142, 490, 756, 770, 574, 232, 470, 645, 47, 86, 690, 733, 972, 792, 112, 144, 55, 650, 810, 608, 125, 655, 148, 88, 548, 357, 567, 905, 271, 637, 320, 413, 128, 76, 183, 702, 308, 653, 347, 355, 739, 37, 88, 711, 829, 200, 856, 884, 850, 665, 493, 975, 320, 641, 63, 869, 998, 630, 774, 269, 268, 94, 682)'::cube LIMIT 10;
Limit (cost=0.55..55.59 rows=10 width=820) (actual time=894342.029..1454440.760 rows=10 loops=1)
-> Index Scan using vectors_vector_idx0 on vectors (cost=0.55..137606356.86 rows=24999816 width=820) (actual time=894342.027..1454440.754 rows=10 loops=1)
Order By: (vector <-> '(705, 501, 321, 345, 591, 58, 229, 420, 341, 628, 84, 476, 700, 71, 815, 616, 45, 686, 886, 102, 378, 172, 263, 538, 665, 553, 475, 845, 540, 963, 893, 209, 479, 357, 914, 70, 415, 142, 490, 756, 770, 574, 232, 470, 645, 47, 86, 690, 733, 972, 792, 112, 144, 55, 650, 810, 608, 125, 655, 148, 88, 548, 357, 567, 905, 271, 637, 320, 413, 128, 76, 183, 702, 308, 653, 347, 355, 739, 37, 88, 711, 829, 200, 856, 884, 850, 665, 493, 975, 320, 641, 63, 869, 998, 630, 774, 269, 268, 94, 682)'::cube)
Planning time: 0.131 ms
Execution time: 1454440.849 ms
(5 rows)
现在查询执行大约 20 分钟。 如何通过索引加快搜索速度?
【问题讨论】:
-
你也可以显示
EXPLAINs 的输出吗? -
我添加了输出
-
统计数据完全关闭。运行
vacuum analyze vectors;,然后再次检查计划 -
我已经完成了真空,但我会再试一次
-
重试没有帮助。我尝试使用具有 100 万行的表。真空有帮助。而用 2500 万是行不通的。也许我需要一些设置来设置数据库?
标签: postgresql multidimensional-array indexing cube