【发布时间】:2022-01-04 21:26:53
【问题描述】:
我有一个包含大约 700 万条记录的表。该表有一个 first_name 和 last_name 列,我想使用 levenshtein() 距离函数对其进行搜索。
select levenshtein('JOHN', first_name) as fn_distance,
levenshtein('DOE', last_name) as ln_distance,
id,
first_name as "firstName",
last_name as "lastName"
from person
where first_name is not null
and last_name is not null
and levenshtein('JOHN', first_name) <= 2
and levenshtein('DOE', last_name) <= 2
order by 1, 2
limit 50;
上面的搜索速度很慢(4 - 5 秒),我该怎么做才能提高性能?应该在两列上创建索引还是其他?
我在下面添加索引后:
create index first_name_idx on person using gin (first_name gin_trgm_ops);
create index last_name_idx on person using gin(last_name gin_trgm_ops);
现在查询大约需要 11 秒。 :(
新查询:
select similarity('JOHN', first_name) as fnsimilarity,
similarity('DOW', last_name) as lnsimilarity,
first_name as "firstName",
last_name as "lastName",
npi
from person
where first_name is not null
and last_name is not null
and similarity('JOHN', first_name) >= 0.2
and similarity('DOW', last_name) >= 0.2
order by 1 desc, 2 desc, npi
limit 50;
【问题讨论】:
-
levenstein()函数有两个输入字符串参数,一个可能对应表person的一列,或者first_name或者last_name,另一个对应你想要的值计算距离。该值不是常量,也不是在表person上的插入或更新事件中计算索引时已知的参数。所以我看不出如何在索引中使用这个函数。 -
可以考虑全文搜索能力,可以依赖索引,见manual
标签: postgresql postgresql-10 postgresql-performance