优化 Select Distinct答案

【问题标题】：Optimize Select Distinct优化 Select Distinct
【发布时间】：2014-04-12 01:36:44
【问题描述】：

问题来了：

这是一些示例数据：

cats=# select * from cats limit 8;
   id    | color |  breed  
---------+-------+---------
 4380929 | grey  | persian
 4380930 | grey  | siese
 4380931 | white | persian
 4380932 | white | siamese
 4380933 | grey  | persian
 4380934 | grey  | siese
 4380935 | white | persian
 4380936 | white | siamese
(8 rows)

以下是构建数据库的方法：

psql postgres postgres -c "CREATE DATABASE cats;"
psql cats postgres -c 'CREATE SEQUENCE cat_id_seq;'
psql cats postgres -c "CREATE TABLE cats (id BIGINT NOT NULL default nextval('cat_id_seq'), color text, breed text);"
bash -c 'for i in `seq 1 1000000` ; do echo -e "white\tpersian\nwhite\tsiamese\ngrey\tpersian\ngrey\tsiese"; done;' > /tmp/cats.sql
psql cats postgres -c "COPY cats (color, breed) FROM /tmp/cats.sql"

这是查询：

psql cats postgres -c "select distinct((color,breed)) from cats;"

运行这个查询需要我：

 Unique  (cost=783138.21..805138.22 rows=6 width=12) (actual time=69816.259..81338.631 rows=5 loops=1)
   ->  Sort  (cost=783138.21..794138.22 rows=4400001 width=12) (actual time=69816.258..80412.546 rows=4400001 loops=1)
     Sort Key: (ROW(color, breed))
     Sort Method: external merge  Disk: 189456kB
     ->  Seq Scan on cats  (cost=0.00..72026.01 rows=4400001 width=12) (actual time=0.013..846.713 rows=4400001 loops=1)
 Total runtime: 81363.373 ms
(6 rows)

输出：

(grey,persian)
(grey,siamese)
(grey,siese)
(white,persian)
(white,siamese)
(5 rows)

你知道我怎么做的这么快吗？

这可行，但仅适用于一个属性，而不适用于两个属性，例如这种情况：http://zogovic.com/post/44856908222/optimizing-postgresql-query-for-distinct-values

我想我需要一个关于“（颜色，品种）”的索引，然后：

创建临时表 TEMP（颜色、品种）
插入 TEMP（从（颜色、品种）不在 TEMP 中的猫中选择（颜色、品种））
直到没有更多可插入...
从 TEMP 中选择 *

但我不知道如何在 postgres 上写这个（没有很多面包店）——我应该使用 RECURSIVE 吗？还是plpgsql？

谢谢！

【问题讨论】：

执行计划看起来不错。可能，您应该优化排序操作。给它更多的内存，这样它就不必溢出到磁盘上。仅排序 4m 行需要 81 秒。或者，强制使用 HashAggregate，这似乎是个好主意，因为组的数量非常少。
正在运行 which 查询？ COPY?相反，索引对此无能为力。但是EXPLAIN 的输出是针对不同的查询，而不是在您的问题中……我怀疑您想要像this one 或this one 这样的解决方案。
你完全正确，这是一个错字，我现在解决了这个问题。谢谢。

标签： sql postgresql query-optimization

【解决方案1】：

所以，经过大量工作，解决方案如下：

首先 - 创建索引：

create index ON cats (color,breed);

第一：简单查询：

cats=# select distinct color,breed from cats;
       row       
-----------------
 (a,b)
 (c,d)
 (grey,persian)
 (grey,siamese)
 (grey,siese)
 (white,persian)
 (white,siamese)
(7 rows)

Time: 853.550 ms

现在你要使用的版本：

WITH RECURSIVE distinct_pairs AS (
    (
        SELECT c as cl FROM cats c where color IS NOT NULL AND breed IS NOT NULL order by c.color,c.breed LIMIT 1
    )
    UNION ALL
    SELECT (
        SELECT c
        FROM cats c
        WHERE
            (c.color,c.breed) > ((p.cl).color,(p.cl).breed)
        ORDER BY c.color,c.breed LIMIT 1
    )
    FROM distinct_pairs p
    WHERE (p.cl).id IS NOT NULL
) SELECT * FROM distinct_pairs p WHERE (p.cl).id IS NOT NULL;
         cl          
---------------------
 (4400007,a,b)
 (5,grey,persian)
 (6,grey,siamese)
 (400006,grey,siese)
 (2,white,persian)
 (4,white,siamese)
(6 rows)

Time: 0.646 ms

快 1300 倍。还不错。

感谢：

【讨论】：

【解决方案2】：

为什么有括号？你知道他们做什么，你需要吗？

如果我放下它们，我会快二十倍：

select distinct color, breed from cats;

将列包装成记录，然后为每次排序比较解包记录，需要大量工作。

【讨论】：

是的，你完全正确，对我来说也快得多。但它仍然可以大大改善（见我的回答）。