【问题标题】:postgresql index 100 dimensional and 25 million rows tablepostgresql 索引 100 维和 2500 万行的表
【发布时间】:2018-10-14 01:37:13
【问题描述】:

我的任务是在 100 维空间中快速找到最近的邻居。 所以我创建了一个测试表:

create extension cube;
create table vectors (id serial, vector cube);
insert into vectors select id, cube(ARRAY[round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000), round(random()*1000)]) from generate_series(1, 25000000) id;

搜索请求:

explain analyze SELECT * FROM vectors ORDER BY vector <-> '(705, 501, 321, 345, 591, 58, 229, 420, 341, 628, 84, 476, 700, 71, 815, 616, 45, 686, 886, 102, 378, 172, 263, 538, 665, 553, 475, 845, 540, 963, 893, 209, 479, 357, 914, 70, 415, 142, 490, 756, 770, 574, 232, 470, 645, 47, 86, 690, 733, 972, 792, 112, 144, 55, 650, 810, 608, 125, 655, 148, 88, 548, 357, 567, 905, 271, 637, 320, 413, 128, 76, 183, 702, 308, 653, 347, 355, 739, 37, 88, 711, 829, 200, 856, 884, 850, 665, 493, 975, 320, 641, 63, 869, 998, 630, 774, 269, 268, 94, 682)'::cube LIMIT 10;

如果没有索引,查找最近邻居的请求大约需要 30 秒。

现在我们将创建一个索引:

CREATE INDEX vectors_vector_idx ON vectors USING GIST (vector);

重复搜索请求:

explain analyze SELECT * FROM vectors ORDER BY vector <-> '(705, 501, 321, 345, 591, 58, 229, 420, 341, 628, 84, 476, 700, 71, 815, 616, 45, 686, 886, 102, 378, 172, 263, 538, 665, 553, 475, 845, 540, 963, 893, 209, 479, 357, 914, 70, 415, 142, 490, 756, 770, 574, 232, 470, 645, 47, 86, 690, 733, 972, 792, 112, 144, 55, 650, 810, 608, 125, 655, 148, 88, 548, 357, 567, 905, 271, 637, 320, 413, 128, 76, 183, 702, 308, 653, 347, 355, 739, 37, 88, 711, 829, 200, 856, 884, 850, 665, 493, 975, 320, 641, 63, 869, 998, 630, 774, 269, 268, 94, 682)'::cube LIMIT 10;
Limit  (cost=0.55..55.59 rows=10 width=820) (actual time=894342.029..1454440.760 rows=10 loops=1)
->  Index Scan using vectors_vector_idx0 on vectors  (cost=0.55..137606356.86 rows=24999816 width=820) (actual time=894342.027..1454440.754 rows=10 loops=1)
     Order By: (vector <-> '(705, 501, 321, 345, 591, 58, 229, 420, 341, 628, 84, 476, 700, 71, 815, 616, 45, 686, 886, 102, 378, 172, 263, 538, 665, 553, 475, 845, 540, 963, 893, 209, 479, 357, 914, 70, 415, 142, 490, 756, 770, 574, 232, 470, 645, 47, 86, 690, 733, 972, 792, 112, 144, 55, 650, 810, 608, 125, 655, 148, 88, 548, 357, 567, 905, 271, 637, 320, 413, 128, 76, 183, 702, 308, 653, 347, 355, 739, 37, 88, 711, 829, 200, 856, 884, 850, 665, 493, 975, 320, 641, 63, 869, 998, 630, 774, 269, 268, 94, 682)'::cube)
 Planning time: 0.131 ms
 Execution time: 1454440.849 ms
(5 rows)

现在查询执行大约 20 分钟。 如何通过索引加快搜索速度?

【问题讨论】:

  • 你也可以显示EXPLAINs 的输出吗?
  • 我添加了输出
  • 统计数据完全关闭。运行vacuum analyze vectors;,然后再次检查计划
  • 我已经完成了真空,但我会再试一次
  • 重试没有帮助。我尝试使用具有 100 万行的表。真空有帮助。而用 2500 万是行不通的。也许我需要一些设置来设置数据库?

标签: postgresql multidimensional-array indexing cube


【解决方案1】:

问题与此任务的少量 RAM (64 GB) 有关。看起来该表已完全加载到 RAM 中,然后进行了搜索。使用索引,表重 100 GB。

【讨论】:

  • 感谢分享。 EXPLAIN (BUFFERS) 对确定类似的事情很有用,不是吗?
  • Laurenz Albe,解释(缓冲区)对我没有多大帮助。通过查看请求执行期间填充 RAM 的时间表,我能够理解这一点。
  • 如果问题是由小的shared_buffers引起的,你应该不仅看到“blocks hit”,还应该看到“blocks read”甚至“blockswritten”。
猜你喜欢
  • 1970-01-01
  • 2011-04-22
  • 1970-01-01
  • 2022-11-15
  • 2015-07-20
  • 1970-01-01
  • 1970-01-01
  • 1970-01-01
  • 2021-08-28
相关资源
最近更新 更多