Postgresql distinct count 提高性能答案

【问题标题】：Postgesql distinct count improve perfomancePostgresql distinct count 提高性能
【发布时间】：2025-11-23 03:15:01
【问题描述】：

我在 Google Cloud SQL 上有一个数据库。它包含一个简单的表格，如下所示：

url_id user_id

url_id 是一个包含整数的字符串，user_id 是一个 14 字符的字符串。我在 url_id 上有一个索引：

CREATE INDEX index_test ON table1 (url_id);

我要运行的请求是获取具有不在给定 id 列表中的 url_id 的不同 user_id 的数量。我是这样做的：

 SET work_mem='4GB';
 select count(*) from (select distinct afficheW from table1 where url_id != '1880' and url_id != '2022' and url_id != '1963' and url_id != '11' and url_id != '32893' and url_id != '19' ) t ;

结果：

 count  
---------
 1242298
(1 row)

Time: 2118,917 ms

该表包含 180 万行。有没有办法让这种类型的请求更快？

【问题讨论】：

请显示您当前查询的explain (analyze, buffers)。

标签： sql postgresql count

【解决方案1】：

试着写成这样：

select count(distinct afficheW)
from table1
where url_id not in (1800, 2022, 1963, 11, 32892, 19);

（这里假设url_id 确实是一个数字，而不是一个字符串。）

然后在table1(url_id, affichew)上添加一个索引。

也就是说，在两秒钟内从一张表中计算出超过一百万个项目并不是那么糟糕。

【讨论】：

【解决方案2】：

除非您的 WHERE 条件消除了大部分行并且您可以使用部分索引，否则最有希望的索引将在 (affichew, url_id) 上。这样，它可以使用仅索引扫描，根据 url_id 过滤掉而不访问表，并以正确的顺序取出行以对其应用唯一性，而无需排序或散列。

此外，我将其写成 not in 比使用 ANDed != 条件列表要快一些。

【讨论】：

您建议使用什么查询（或多个查询），以便可以使用仅索引扫描？另外，是否应该将 url_id 列更改为整数（不是包含整数的字符串）以提高速度？
您显示的查询。然而，规划者可能不会自然而然地选择使用它，我必须set enable_hashagg = off 才能获得稍快的仅索引扫描才能使用。是的，如果该值是一个整数，那么它应该被表示为一个整数。

【解决方案3】：

另一种方法是使用group by 而不是distinct：

select
    afficheW
    , count(*)
from
    table1
where
    url_id not in (1800, 2022, 1963, 11, 32893, 19)
group by afficheW;

在这种情况下，您很可能需要在afficheW 和url_id 上创建一个~~单独的~~ 多列索引（正如@jjanes 和@GordonLinoff 所建议和解释的那样）。我认为url_id 应该是这个多列索引中的第一列，因为你在where 子句中有一个明确的条件。

如果此查询性能至关重要，您可以在afficheW 上使用partial index，其中url_id 满足您的where 子句。

作为@GordonLinoff，我还假设url_id 是数字（或者应该是 数字，以节省磁盘空间并提高性能），我也使用not in (...) 作为写多个!=的更易读的方式。

另请参阅：

多列索引中的列排序信息（带有基准）：Multicolumn index and performance

【讨论】：

【解决方案4】：

您可以尝试在此处仅执行单级不同计数查询：

select count(distinct afficheW)
from table1
where url_id != '1880' and url_id != '2022' and url_id != '1963' and
      url_id != '11' and url_id != '32893' and url_id != '19';

这至少避免了不需要存在的外部显式计数查询。

【讨论】：

这可能会适得其反。 count(distinct...) 从未被教过如何使用哈希聚合或并行查询。