【问题标题】：MySQL GROUP BY optimizationMySQL GROUP BY 优化
【发布时间】：2010-11-26 04:27:13
【问题描述】：

表格

CREATE TABLE Test4_ClusterMatches 
(
    `match_index` INT UNSIGNED,
    `cluster_index` INT UNSIGNED, 
    `id` INT NOT NULL AUTO_INCREMENT,
    `tfidf` FLOAT,
    PRIMARY KEY (`cluster_index`,`match_index`,`id`)
);

我要运行的查询

mysql> explain SELECT `match_index`, SUM(`tfidf`) AS total 
FROM Test4_ClusterMatches WHERE `cluster_index` IN (1,2,3 ... 3000) 
GROUP BY `match_index`;

查询的问题

它使用临时和文件排序，所以它很慢

+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows  | Extra                                                     |
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+
|  1 | SIMPLE      | Test4_ClusterMatches | range | PRIMARY       | PRIMARY | 4       | NULL | 51540 | Using where; Using index; Using temporary; Using filesort | 
+----+-------------+----------------------+-------+---------------+---------+---------+------+-------+-----------------------------------------------------------+

使用当前索引，查询需要首先按 cluster_index 排序，以消除临时和文件排序的使用，但这样做会给 sum(tfidf) 带来错误的结果。将主键更改为

PRIMARY KEY (`match_index`,`cluster_index`,`id`)

不使用文件排序或临时表，但它使用 14,932,441 行，所以它也很慢

+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
| id | select_type | table                | type  | possible_keys | key     | key_len | ref  | rows     | Extra                    |
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+
|  1 | SIMPLE      | Test5_ClusterMatches | index | NULL          | PRIMARY | 16      | NULL | 14932441 | Using where; Using index | 
+----+-------------+----------------------+-------+---------------+---------+---------+------+----------+--------------------------+

紧密索引扫描

Using tight index scan 只搜索一个索引

mysql&gt; explain SELECT match_index, SUM(tfidf) AS total
FROM Test4_ClusterMatches WHERE cluster_index =3000 
GROUP BY match_index;

消除临时表和文件排序。

+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
| id | select_type | table                | type | possible_keys | key     | key_len | ref   | rows | Extra                    |
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+
|  1 | SIMPLE      | Test4_ClusterMatches | ref  | PRIMARY       | PRIMARY | 4       | const |   27 | Using where; Using index | 
+----+-------------+----------------------+------+---------------+---------+---------+-------+------+--------------------------+

我不确定这是否可以用我还没有遇到的一些神奇的 sql-fu 来利用？

问题

如何更改我的查询以使其使用 3,000 个 cluster_indexes，避免使用临时和文件排序而不需要使用 14,932,441 行？

更新

使用表格

CREATE TABLE Test6_ClusterMatches 
(
  match_index INT UNSIGNED,
  cluster_index INT UNSIGNED, 
  id INT NOT NULL AUTO_INCREMENT,
  tfidf FLOAT,
  PRIMARY KEY (id),
  UNIQUE KEY(cluster_index,match_index)
);

然后下面的查询给出了 10 行集合（0.41 秒）:)

SELECT `match_index`, SUM(`tfidf`) AS total FROM Test6_ClusterMatches WHERE 
`cluster_index` IN (.....)
GROUP BY `match_index` ORDER BY total DESC LIMIT 0,10;

但它使用临时和文件排序

+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
| id | select_type | table                | type  | possible_keys | key           | key_len | ref  | rows  | Extra                                        |
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+
|  1 | SIMPLE      | Test6_ClusterMatches | range | cluster_index | cluster_index | 5       | NULL | 78663 | Using where; Using temporary; Using filesort | 
+----+-------------+----------------------+-------+---------------+---------------+---------+------+-------+----------------------------------------------+

我想知道是否有办法通过消除使用临时和使用文件排序来更快地获得它？

【问题讨论】：

@Ben - 您不需要 cluster_index,match_index 作为主键，id 已经定义为 auto_increment，将其单独作为主键。在cluster_index,match_index上建立一个唯一键，然后重复查询，看看有什么改进吗？
要利用 innodb CLUSTERED INDEX 或 PRIMARY KEY (dev.mysql.com/doc/refman/5.0/en/innodb-index-types.html) 你应该有以下主键 (cluster_id, match_id,...) - 然后试试看：P
@ajreal 是的，它将性能提高了大约 20 个数量级（并给出了正确的结果），但它仍然具有使用临时和文件排序
@Ben - 用于文件排序 - mysqlperformanceblog.com/2009/03/05/… 。至于临时的，dev.mysql.com/doc/refman/5.0/en/internal-temporary-tables.html，而这个mysqlperformanceblog.com/2007/08/16/…如果是内存表，也不会太差
@f00 将其作为主键而不是唯一键会有所不同吗？

标签： optimization mysql query-optimization

【解决方案1】：

我快速浏览了一下，这就是我想出的 - 希望它有所帮助......

SQL 表

drop table if exists cluster_matches;
create table cluster_matches
(
 cluster_id int unsigned not null,
 match_id int unsigned not null,
 ...
 tfidf float not null default 0,
 primary key (cluster_id, match_id) -- if this isnt unique add id to the end !!
)
engine=innodb;

测试数据

select count(*) from cluster_matches

count(*)
========
17974591

select count(distinct(cluster_id)) from cluster_matches;

count(distinct(cluster_id))
===========================
1000000

select count(distinct(match_id)) from cluster_matches;

count(distinct(match_id))
=========================
6000

explain select
 cm.match_id,
 sum(tfidf) as sum_tfidf,
 count(*) as count_tfidf
from
 cluster_matches cm
where
 cm.cluster_id between 5000 and 10000
group by
 cm.match_id
order by
 sum_tfidf desc limit 10;

id  select_type table   type    possible_keys   key key_len ref rows    Extra
==  =========== =====   ====    =============   === ======= === ====    =====
1   SIMPLE  cm  range   PRIMARY PRIMARY 4       290016  Using where; Using temporary; Using filesort

runtime - 0.067 seconds.

0.067 秒的运行时间相当可观，但我认为我们可以做得更好。

存储过程

您将不得不原谅我不想输入/传递 5000 多个随机 cluster_id 的列表！

call sum_cluster_matches(null,1); -- for testing
call sum_cluster_matches('1,2,3,4,....5000',1);

大部分存储过程不是很优雅，但它所做的只是将一个 csv 字符串拆分为单独的 cluster_ids 并填充一个临时表。

drop procedure if exists sum_cluster_matches;

delimiter #

create procedure sum_cluster_matches
(
in p_cluster_id_csv varchar(65535),
in p_show_explain tinyint unsigned
)
proc_main:begin

declare v_id varchar(10);
declare v_done tinyint unsigned default 0;
declare v_idx int unsigned default 1;

    create temporary table tmp(cluster_id int unsigned not null primary key);   

    -- not every elegant - split the string into tokens and put into a temp table...

    if p_cluster_id_csv is not null then
        while not v_done do
            set v_id = trim(substring(p_cluster_id_csv, v_idx, 
                if(locate(',', p_cluster_id_csv, v_idx) > 0, 
                        locate(',', p_cluster_id_csv, v_idx) - v_idx, length(p_cluster_id_csv))));

                if length(v_id) > 0 then
                set v_idx = v_idx + length(v_id) + 1;
                        insert ignore into tmp values(v_id);
                else
                set v_done = 1;
                end if;
        end while;
    else
        -- instead of passing in a huge comma separated list of cluster_ids im cheating here to save typing
        insert into tmp select cluster_id from clusters where cluster_id between 5000 and 10000;
        -- end cheat
    end if;

    if p_show_explain then

        select count(*) as count_of_tmp from tmp;

        explain
        select
         cm.match_id,
         sum(tfidf) as sum_tfidf,
         count(*) as count_tfidf
        from
         cluster_matches cm
        inner join tmp on tmp.cluster_id = cm.cluster_id
        group by
         cm.match_id
        order by
         sum_tfidf desc limit 10;
    end if;

    select
     cm.match_id,
     sum(tfidf) as sum_tfidf,
     count(*) as count_tfidf
    from
     cluster_matches cm
    inner join tmp on tmp.cluster_id = cm.cluster_id
    group by
     cm.match_id
    order by
     sum_tfidf desc limit 10;

    drop temporary table if exists tmp;

end proc_main #

delimiter ;

结果

call sum_cluster_matches(null,1);

count_of_tmp
============
5001

id  select_type table   type    possible_keys   key key_len ref rows    Extra
==  =========== =====   ====    =============   === ======= === ====    =====
1   SIMPLE  tmp index   PRIMARY PRIMARY 4       5001    Using index; Using temporary; Using filesort
1   SIMPLE  cm  ref PRIMARY PRIMARY 4   vldb_db.tmp.cluster_id  8   

match_id    sum_tfidf   count_tfidf
========    =========   ===========
1618        387         64
1473        387         64
3307        382         64
2495        373         64
1135        373         64
3832        372         57
3203        362         58
5464        358         67
2100        355         60
1634        354         52

runtime 0.028 seconds.

解释计划和运行时间大大改进。

【讨论】：

我认为你不能忽略 auto_increment 字段，它使每条记录彼此都是唯一的
好吧，我并没有真正看到 auto_inc PK 中的意义，但如果需要，您可以将其添加到现有主键（cluster_id、match_id、some_other_id）的末尾，而不会改变性能跨度>
感谢您的详细回答 :) 我将数据库移动到另一台机器上，因为工作时间的负载导致它被交换到磁盘。现在第一次调用时平均为 0.4s - 0.2s，第二次查询时为 0.006。猜猜我需要开始使用 mysqladmin 变量了

【解决方案2】：

如果WHERE 条件中的cluster_index 值是连续的，则使用IN 代替：

WHERE (cluster_index >= 1) and (cluster_index <= 3000)

如果这些值不连续，那么您可以创建一个临时表来保存带有索引的 cluster_index 值，并在临时表中使用 INNER JOIN。

【讨论】：