postgres：全文搜索：查找重复文本行的最快方法？答案

【问题标题】：postgres: fulltext search: fastest way to find duplicate text rows?postgres：全文搜索：查找重复文本行的最快方法？
【发布时间】：2019-10-06 07:32:11
【问题描述】：

:-)

在表格中查找重复文本的最快方法是什么，即表格中的行在一列中的文本在整个表格中至少出现两次？该表包含超过 1.6 亿行。

我有一个包含以下列的表：id、maintext 和 maintext_token，后者是使用 to_tsvector(maintext); 创建的。此外，我在maintext_token上创建了一个GIN索引，即create index idx_maintext_tokens on tablename using gin(maintext_token);

目前，我正在使用以下内容，但这需要相当长的时间：我有一个包含以下列的表：id、maintext 和maintext_token，后者是使用to_tsvector(maintext); 创建的。此外，我在maintext_token上创建了一个GIN索引，即create index idx_maintext_tokens on tablename using gin(maintext_token);

select maintext, count(maintext)
from ccnc
group by maintext
having count(maintext)>1
order by maintext;

我也尝试做同样的操作，但我没有使用maintext，而是使用maintext_token 列进行比较：

select maintext_token, count(maintext_token)
from ccnc
group by maintext_token
having count(maintext_token)>1
order by maintext_token;

两个查询似乎都运行了很长时间，尽管我预计至少第二个查询会快得多，因为 postgres 可以使用索引进行比较。

提前感谢您提供任何见解！干杯:)

【问题讨论】：

标签： postgresql

【解决方案1】：

你说你想测试是否相等，所以你可能想对文本进行散列，然后搜索散列。您可以使用散列索引来执行此操作，也可以索引文本的散列。我最近在一个相关问题上得到了一些帮助，您可以在这里找到详细信息和比较：

Searching on expression indexes

【讨论】：