MySQL GROUP BY 减慢查询 x1000 倍答案

【问题标题】：MySQL GROUP BY slows down query x1000 timesMySQL GROUP BY 减慢查询 x1000 倍
【发布时间】：2021-02-13 23:05:43
【问题描述】：

我正在努力为使用 MySQL 数据库的 Django 应用程序设置正确、有效的索引。问题在于 article 表，目前它的行数略高于 100 万行，查询速度并没有我们想要的那么快。

文章表结构大致如下：

Field   Type
id  int
date_published  datetime(6) 
date_retrieved  datetime(6) 
title   varchar(500)    
author  varchar(200)    
content longtext    
source_id   int
online  tinyint(1)
main_article_of_duplicate_group tinyint(1)

经过多次尝试，我发现以下索引提供了最佳性能：

CREATE INDEX search_index ON newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online);

有问题的查询是：

SELECT 
    `newsarticle`.`id`,
    `newsarticle`.`url`,
    `newsarticle`.`date_published`,
    `newsarticle`.`date_retrieved`,
    `newsarticle`.`title`,
    `newsarticle`.`summary_provided`,
    `newsarticle`.`summary_generated`,
    `newsarticle`.`source_id`,
    COUNT(CASE WHEN `newsarticlefeedback`.`is_relevant` THEN `newsarticlefeedback`.`id` ELSE NULL END) AS `count_relevent`,
    COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`,
    (
      SELECT U0.`is_relevant`
      FROM `newsarticlefeedback` U0
      WHERE (U0.`news_id_id` = `newsarticle`.`id` AND U0.`user_id_id` = 27)
      ORDER BY U0.`created_date` DESC
      LIMIT 1
    ) AS `is_relevant`,
    CASE
        WHEN `newsarticle`.`content` = '' THEN 0
        ELSE 1
    END AS `is_content`,
    `newsproviders_newsprovider`.`id`,
    `newsproviders_newsprovider`.`name_long`
FROM
    `newsarticle` USE INDEX (SEARCH_INDEX)
        INNER JOIN
    `newsarticle_topics` ON (`newsarticle`.`id` = `newsarticle_topics`.`newsarticle_id`)
        LEFT OUTER JOIN
    `newsarticlefeedback` ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`)
        LEFT OUTER JOIN
    `newsproviders_newsprovider` ON (`newsarticle`.`source_id` = `newsproviders_newsprovider`.`id`)
WHERE
    ((1)
        AND `newsarticle`.`main_article_of_duplicate_group`
        AND `newsarticle`.`online`
        AND `newsarticle_topics`.`newstopic_id` = 42
        AND `newsarticle`.`date_published` >= '2020-08-08 08:39:03.199488')
GROUP BY `newsarticle`.`id`
ORDER BY `newsarticle`.`date_published` DESC
LIMIT 30

注意：我必须明确使用索引，否则查询会慢得多。此查询大约需要 1.4 秒。

但是当我只删除 GROUP BY 语句时，查询需要 1-10 毫秒。我试图将新闻文章 ID 添加到不同位置的索引，但没有运气。

这是 EXPLAIN 的输出（来自 Django）：

ID  SELECT_TYPE TABLE   PARTITIONS  TYPE    POSSIBLE_KEYS   KEY KEY_LEN REF ROWS    FILTERED    EXTRA
1   PRIMARY newsarticle_topics  None    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk_summarize   newsartic_newstopic_id_ddd996b6_fk_summarize    4   const   312628  100.0   Using temporary; Using filesort
1   PRIMARY newsarticle None    eq_ref  PRIMARY,newsartic_source_id_6ea2b978_fk_summarize,newsartic_topic_id_b67ae2c9_fk_summarize,kek,last_updated,last_update,search_index,fulltext_idx_content   PRIMARY 4   newstech.newsarticle_topics.newsarticle_id  1   22.69   Using where
1   PRIMARY newsarticlefeedback None    ref newsartic_news_id_id_5af7594b_fk_summarize  newsartic_news_id_id_5af7594b_fk_summarize  5   newstech.newsarticle_topics.newsarticle_id  1   100.0   None
1   PRIMARY newsproviders_newsprovider  None    eq_ref  PRIMARY,    PRIMARY 4   newstech.newsarticle.source_id  1   100.0   None
2   DEPENDENT SUBQUERY  U0  None    ref newsartic_news_id_id_5af7594b_fk_summarize,newsartic_user_id_id_fc217cfe_fk_auth_user   newsartic_user_id_id_fc217cfe_fk_auth_user  5   const   1   10.0    Using where; Using filesort

有趣的是，相同的查询在 MySQL Workbench 和 Django 调试工具栏中给出了不同的 EXPLAIN（如果你愿意，我也可以从工作台粘贴 EXPLAIN）。但性能或多或少是一样的。您是否知道如何增强索引以便快速搜索？

谢谢

编辑：我在这里粘贴了来自 MySQL Workbench 的 EXPLAIN，它不同但似乎更真实（不知道为什么 Django 调试工具栏解释不同）

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle NULL    range   PRIMARY,newsartic_source_id_6ea2b978_fk_,newsartic_topic_id_b67ae2c9_fk,kek,last_updated,last_update,search_index,fulltext_idx_content  search_index    8   NULL    227426  81.00   Using index condition; Using MRR; Using temporary; Using filesort
1   PRIMARY newsarticle_topics  NULL    eq_ref  newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq    8   newstech.newsarticle.id,const   1   100.00  Using index
1   PRIMARY newsarticlefeedback NULL    ref newsartic_news_id_id_5af7594b_fk    newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   100.00  NULL
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,newsartic_user_id_id_fc217cfe_fk_auth_user newsartic_user_id_id_fc217cfe_fk_auth_user  5   const   1   10.00   Using where; Using filesort

编辑2：下面是我从查询中删除 GROUP BY 时的解释（使用 MySQL Workbench）：

id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,newsarticle,NULL,range,search_index,search_index,8,NULL,227426,81.00,"Using index condition"
1,SIMPLE,newsarticle_topics,NULL,eq_ref,"newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk",newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,8,"newstech.newsarticle.id,const",1,100.00,"Using index"
1,SIMPLE,newsarticlefeedback,NULL,ref,newsartic_news_id_id_5af7594b_fk,newsartic_news_id_id_5af7594b_fk,5,newstech.newsarticle.id,1,100.00,"Using index"
1,SIMPLE,newsproviders_newsprovider,NULL,eq_ref,"PRIMARY,",PRIMARY,4,newstech.newsarticle.source_id,1,100.00,NULL

EDIT3：

应用 Rick 建议的更改后（谢谢！）：

newsarticle(id, online, main_article_of_duplicate_group, date_published) newsarticle_topics (newstopic_id, newsarticle_id) 和 (newsarticle_id, newstopic_id) 的两个索引

WITH USE_INDEX（需要 1.2 秒）

解释：

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle_topics  NULL    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   opposite    4   const   346286  100.00  Using index; Using temporary; Using filesort
1   PRIMARY newsarticle NULL    ref search_index    search_index    4   newstech.newsarticle_topics.newsarticle_id  1   27.00   Using index condition
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   10.00   Using where
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   90.00   Using where

WITHOUT USE_INDEX 子句（耗时 2.6 秒）

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle_topics  NULL    ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   opposite    4   const   346286  100.00  Using index; Using temporary; Using filesort
1   PRIMARY newsarticle NULL    eq_ref  PRIMARY,search_index    PRIMARY 4   newstech.newsarticle_topics.newsarticle_id  1   27.00   Using where
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   10.00   Using where
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk    5   newstech.newsarticle.id 1   90.00   Using where

用于比较索引 - newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online) 与 USE INDEX（只需 1-3 毫秒！）

id  select_type table   partitions  type    possible_keys   key key_len ref rows    filtered    Extra
1   PRIMARY newsarticle NULL    range   search_index    search_index    8   NULL    238876  81.00   Using index condition
1   PRIMARY newsproviders_newsprovider  NULL    eq_ref  PRIMARY,filter_index    PRIMARY 4   newstech.newsarticle.source_id  1   100.00  NULL
1   PRIMARY newsarticle_topics  NULL    eq_ref  newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite   newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq    8   newstech.newsarticle.id,const   1   100.00  Using index
4   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   100.00  Using filesort
3   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  6   newstech.newsarticle.id,const   1   100.00  Using index
2   DEPENDENT SUBQUERY  U0  NULL    ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index  5   newstech.newsarticle.id 1   90.00   Using where; Using index

【问题讨论】：

没有 group by 的查询是否只是因为您只获取前 30 个结果而快速？获得所有结果的速度有多快？
@AndrewSayer 如果我也删除了 LIMIT 子句，那么“持续时间”仍然是几毫秒，但“获取”大约需要 5 秒（返回约 35k 行）。
计划中的第一行表示它正在搜索 newsarticle_topics 表并使用临时文件查看超过 300k 行。这会很慢。该表是如何定义和索引的？
A group by 需要对结果进行排序以执行聚合。没有GROUP BY 的解释是什么？ date_published 应该在您的 search_index 索引中的最后一个。避免强制索引。
您当前的查询应该失败，请编辑并修复它 - 您没有按 newsarticle 中的某些列进行分组，您没有进行聚合。在索引方面，您需要遵循索引的黄金法则，当您从左到右添加列时，只有在使用相等条件过滤左侧列时，它们才能有效地访问索引。这意味着更明智的人会先使用newstopic_id，然后是date_published。如果您的 not null 过滤列对减少从表中读取的数据量确实有用，则可以稍后包含它们。

标签： mysql sql django indexing query-performance

【解决方案1】：

最后，我弄清楚了这个查询有什么问题。首先，在 Django 中，GROUP BY 语句在注解中使用 Count 时会自动添加。所以最简单的解决方案是通过嵌套注释来避免它。

这在https://stackoverflow.com/a/43771738/4464554的答案中得到了很好的解释

感谢大家的时间和帮助:)

【讨论】：

【解决方案2】：

main_article_of_duplicate_group 是真/假标志吗？

如果优化器选择以newsarticle_topics开头：

 newsarticle_topics:  INDEX(newstopic_id, newsarticle_id)
 newsarticle:  INDEX(newsarticle_id, online,
                     main_article_of_duplicate_group, date_published)

如果newsarticle_topics 是一个多对多映射表，则去掉id 并使PRIMARY KEY 成为该对，加上相反方向的二级索引。更多讨论：http://mysql.rjweb.org/doc.php/index_cookbook_mysql#many_to_many_mapping_table

如果优化器选择以newsarticle 开头（这似乎更有可能）：

 newsarticle_topics:  INDEX(newsarticle_id, newstopic_id)
 newsarticle:  INDEX(online, main_article_of_duplicate_group, date_published)

同时，newsarticlefeedback 需要这个，按照给定的顺序：

INDEX(news_id_id, user_id_id, created_date, isrelevant)

代替

    COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`,
    LEFT OUTER JOIN  `newsarticlefeedback`
          ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`)

有

    ( SELECT COUNT(*) FROM newsarticlefeedback
          WHERE `newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`
    ) AS `count_nonrelevent`,

【讨论】：

是的，main_article_of_duplicate_group 是布尔标志。我为 newsarticle 和 newsarticle_topics 应用了索引，newsarticle_topics 在我删除 USE INDEX 子句后首先用于优化器 -> 并且它非常好，除了与大量文章相关的主题（大约需要 2 秒）。 1 PRIMARY newsarticle_topics None ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite opposite 4 const 346286 100.0 Using index; Using temporary; Using filesort1 PRIMARY newsarticle None eq_ref PRIMARY,search_index PRIMARY 4 newstech.newsarticle_topics.newsarticle_id 1 27.0 Using where
不幸的是 Django 不支持主键的复合键。是的，最大的性能提升会像您在 newsarticlefeedback 表中建议的那样进行更改：)
@Bob - 感谢您的反馈。哪件事促进了新闻文章的反馈？综合指数还是重新制定（或两者兼而有之）？
主要影响来自重新制定，因为它删除了GROUP BY，但索引也有帮助:) 只更改这两件事，让我的旧use index 子句和旧索引工作得很快，你认为吗是否可以以某种方式调整您的主张（新闻文章和新闻主题的索引）以摆脱use index？谢谢
@Bob - 很有可能。测试是“微不足道的”——运行EXPLAIN SELECT ... 有和没有USE INDEX()。如果输出相同，则应删除 USE。如果它们不同，请同时显示；我可能（也可能不会）有进一步的建议。

【解决方案3】：

我碰巧有一种技术可以很好地处理按日期分类、过滤和排序的“新闻文章”。它甚至可以处理“禁运”、“过期”、软“删除”等。

最大的目标是在执行时只触摸 30 行

ORDER BY `newsarticle`.`date_published` DESC
LIMIT 30

但目前WHERE 子句必须查看两个表才能进行过滤。这会导致接触 35K 或更多行。

它需要在一侧构建一个具有 3 列的简单表格：

主题（或其他过滤类别），
日期（仅获取最新的 30 个），
article_id（仅执行 30 次 JOIN 即可获取文章的其余信息）

对该表进行适当的索引使得搜索非常高效。

在 this 表中使用合适的DELETEs，可以有效地处理像online 或main_article 这样的简单标志。不在这个额外的表中包含标志；而是不包含任何不应显示的行。

（我看过其他“新闻”网站因为没有使用这种技术而崩溃。）

请注意，30 和 35K 之间的差异约为 1000 倍。

【讨论】：

嗯，这很有趣。我会试试看。感谢您花时间在这里帮助我 :)
嘿，我只是想让你知道你的技术真的很棒，而且运行速度非常快。非常感谢！
@Bob - 感谢您的反馈。您迅速实施了它。我想这也说明我的文件很清楚。