【发布时间】:2021-02-13 23:05:43
【问题描述】:
我正在努力为使用 MySQL 数据库的 Django 应用程序设置正确、有效的索引。 问题在于 article 表,目前它的行数略高于 100 万行,查询速度并没有我们想要的那么快。
文章表结构大致如下:
Field Type
id int
date_published datetime(6)
date_retrieved datetime(6)
title varchar(500)
author varchar(200)
content longtext
source_id int
online tinyint(1)
main_article_of_duplicate_group tinyint(1)
经过多次尝试,我发现以下索引提供了最佳性能:
CREATE INDEX search_index ON newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online);
有问题的查询是:
SELECT
`newsarticle`.`id`,
`newsarticle`.`url`,
`newsarticle`.`date_published`,
`newsarticle`.`date_retrieved`,
`newsarticle`.`title`,
`newsarticle`.`summary_provided`,
`newsarticle`.`summary_generated`,
`newsarticle`.`source_id`,
COUNT(CASE WHEN `newsarticlefeedback`.`is_relevant` THEN `newsarticlefeedback`.`id` ELSE NULL END) AS `count_relevent`,
COUNT(`newsarticlefeedback`.`id`) AS `count_nonrelevent`,
(
SELECT U0.`is_relevant`
FROM `newsarticlefeedback` U0
WHERE (U0.`news_id_id` = `newsarticle`.`id` AND U0.`user_id_id` = 27)
ORDER BY U0.`created_date` DESC
LIMIT 1
) AS `is_relevant`,
CASE
WHEN `newsarticle`.`content` = '' THEN 0
ELSE 1
END AS `is_content`,
`newsproviders_newsprovider`.`id`,
`newsproviders_newsprovider`.`name_long`
FROM
`newsarticle` USE INDEX (SEARCH_INDEX)
INNER JOIN
`newsarticle_topics` ON (`newsarticle`.`id` = `newsarticle_topics`.`newsarticle_id`)
LEFT OUTER JOIN
`newsarticlefeedback` ON (`newsarticle`.`id` = `newsarticlefeedback`.`news_id_id`)
LEFT OUTER JOIN
`newsproviders_newsprovider` ON (`newsarticle`.`source_id` = `newsproviders_newsprovider`.`id`)
WHERE
((1)
AND `newsarticle`.`main_article_of_duplicate_group`
AND `newsarticle`.`online`
AND `newsarticle_topics`.`newstopic_id` = 42
AND `newsarticle`.`date_published` >= '2020-08-08 08:39:03.199488')
GROUP BY `newsarticle`.`id`
ORDER BY `newsarticle`.`date_published` DESC
LIMIT 30
注意:我必须明确使用索引,否则查询会慢得多。 此查询大约需要 1.4 秒。
但是当我只删除 GROUP BY 语句时,查询需要 1-10 毫秒。 我试图将新闻文章 ID 添加到不同位置的索引,但没有运气。
这是 EXPLAIN 的输出(来自 Django):
ID SELECT_TYPE TABLE PARTITIONS TYPE POSSIBLE_KEYS KEY KEY_LEN REF ROWS FILTERED EXTRA
1 PRIMARY newsarticle_topics None ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk_summarize newsartic_newstopic_id_ddd996b6_fk_summarize 4 const 312628 100.0 Using temporary; Using filesort
1 PRIMARY newsarticle None eq_ref PRIMARY,newsartic_source_id_6ea2b978_fk_summarize,newsartic_topic_id_b67ae2c9_fk_summarize,kek,last_updated,last_update,search_index,fulltext_idx_content PRIMARY 4 newstech.newsarticle_topics.newsarticle_id 1 22.69 Using where
1 PRIMARY newsarticlefeedback None ref newsartic_news_id_id_5af7594b_fk_summarize newsartic_news_id_id_5af7594b_fk_summarize 5 newstech.newsarticle_topics.newsarticle_id 1 100.0 None
1 PRIMARY newsproviders_newsprovider None eq_ref PRIMARY, PRIMARY 4 newstech.newsarticle.source_id 1 100.0 None
2 DEPENDENT SUBQUERY U0 None ref newsartic_news_id_id_5af7594b_fk_summarize,newsartic_user_id_id_fc217cfe_fk_auth_user newsartic_user_id_id_fc217cfe_fk_auth_user 5 const 1 10.0 Using where; Using filesort
有趣的是,相同的查询在 MySQL Workbench 和 Django 调试工具栏中给出了不同的 EXPLAIN(如果你愿意,我也可以从工作台粘贴 EXPLAIN)。但性能或多或少是一样的。 您是否知道如何增强索引以便快速搜索?
谢谢
编辑: 我在这里粘贴了来自 MySQL Workbench 的 EXPLAIN,它不同但似乎更真实(不知道为什么 Django 调试工具栏解释不同)
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY newsarticle NULL range PRIMARY,newsartic_source_id_6ea2b978_fk_,newsartic_topic_id_b67ae2c9_fk,kek,last_updated,last_update,search_index,fulltext_idx_content search_index 8 NULL 227426 81.00 Using index condition; Using MRR; Using temporary; Using filesort
1 PRIMARY newsarticle_topics NULL eq_ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq 8 newstech.newsarticle.id,const 1 100.00 Using index
1 PRIMARY newsarticlefeedback NULL ref newsartic_news_id_id_5af7594b_fk newsartic_news_id_id_5af7594b_fk 5 newstech.newsarticle.id 1 100.00 NULL
1 PRIMARY newsproviders_newsprovider NULL eq_ref PRIMARY PRIMARY 4 newstech.newsarticle.source_id 1 100.00 NULL
2 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,newsartic_user_id_id_fc217cfe_fk_auth_user newsartic_user_id_id_fc217cfe_fk_auth_user 5 const 1 10.00 Using where; Using filesort
编辑2: 下面是我从查询中删除 GROUP BY 时的解释(使用 MySQL Workbench):
id,select_type,table,partitions,type,possible_keys,key,key_len,ref,rows,filtered,Extra
1,SIMPLE,newsarticle,NULL,range,search_index,search_index,8,NULL,227426,81.00,"Using index condition"
1,SIMPLE,newsarticle_topics,NULL,eq_ref,"newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,newsartic_newstopic_id_ddd996b6_fk",newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,8,"newstech.newsarticle.id,const",1,100.00,"Using index"
1,SIMPLE,newsarticlefeedback,NULL,ref,newsartic_news_id_id_5af7594b_fk,newsartic_news_id_id_5af7594b_fk,5,newstech.newsarticle.id,1,100.00,"Using index"
1,SIMPLE,newsproviders_newsprovider,NULL,eq_ref,"PRIMARY,",PRIMARY,4,newstech.newsarticle.source_id,1,100.00,NULL
EDIT3:
应用 Rick 建议的更改后(谢谢!):
newsarticle(id, online, main_article_of_duplicate_group, date_published) newsarticle_topics (newstopic_id, newsarticle_id) 和 (newsarticle_id, newstopic_id) 的两个索引
WITH USE_INDEX(需要 1.2 秒)
解释:
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY newsarticle_topics NULL ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite opposite 4 const 346286 100.00 Using index; Using temporary; Using filesort
1 PRIMARY newsarticle NULL ref search_index search_index 4 newstech.newsarticle_topics.newsarticle_id 1 27.00 Using index condition
1 PRIMARY newsproviders_newsprovider NULL eq_ref PRIMARY,filter_index PRIMARY 4 newstech.newsarticle.source_id 1 100.00 NULL
4 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index 5 newstech.newsarticle.id 1 100.00 Using filesort
3 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk 5 newstech.newsarticle.id 1 10.00 Using where
2 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk 5 newstech.newsarticle.id 1 90.00 Using where
WITHOUT USE_INDEX 子句(耗时 2.6 秒)
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY newsarticle_topics NULL ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite opposite 4 const 346286 100.00 Using index; Using temporary; Using filesort
1 PRIMARY newsarticle NULL eq_ref PRIMARY,search_index PRIMARY 4 newstech.newsarticle_topics.newsarticle_id 1 27.00 Using where
1 PRIMARY newsproviders_newsprovider NULL eq_ref PRIMARY,filter_index PRIMARY 4 newstech.newsarticle.source_id 1 100.00 NULL
4 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index 5 newstech.newsarticle.id 1 100.00 Using filesort
3 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk 5 newstech.newsarticle.id 1 10.00 Using where
2 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index newsartic_news_id_id_5af7594b_fk 5 newstech.newsarticle.id 1 90.00 Using where
用于比较索引 - newsarticle(date_published DESC, main_article_of_duplicate_group, source_id, online) 与 USE INDEX(只需 1-3 毫秒!)
id select_type table partitions type possible_keys key key_len ref rows filtered Extra
1 PRIMARY newsarticle NULL range search_index search_index 8 NULL 238876 81.00 Using index condition
1 PRIMARY newsproviders_newsprovider NULL eq_ref PRIMARY,filter_index PRIMARY 4 newstech.newsarticle.source_id 1 100.00 NULL
1 PRIMARY newsarticle_topics NULL eq_ref newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq,opposite newsarticle_t_newsarticle_id_newstopic_6b1123b3_uniq 8 newstech.newsarticle.id,const 1 100.00 Using index
4 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index 5 newstech.newsarticle.id 1 100.00 Using filesort
3 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index 6 newstech.newsarticle.id,const 1 100.00 Using index
2 DEPENDENT SUBQUERY U0 NULL ref newsartic_news_id_id_5af7594b_fk,feedback_index feedback_index 5 newstech.newsarticle.id 1 90.00 Using where; Using index
【问题讨论】:
-
没有 group by 的查询是否只是因为您只获取前 30 个结果而快速?获得所有结果的速度有多快?
-
@AndrewSayer 如果我也删除了 LIMIT 子句,那么“持续时间”仍然是几毫秒,但“获取”大约需要 5 秒(返回约 35k 行)。
-
计划中的第一行表示它正在搜索 newsarticle_topics 表并使用临时文件查看超过 300k 行。这会很慢。该表是如何定义和索引的?
-
A
group by需要对结果进行排序以执行聚合。没有GROUP BY的解释是什么?date_published应该在您的search_index索引中的最后一个。避免强制索引。 -
您当前的查询应该失败,请编辑并修复它 - 您没有按
newsarticle中的某些列进行分组,您没有进行聚合。在索引方面,您需要遵循索引的黄金法则,当您从左到右添加列时,只有在使用相等条件过滤左侧列时,它们才能有效地访问索引。这意味着更明智的人会先使用newstopic_id,然后是date_published。如果您的not null过滤列对减少从表中读取的数据量确实有用,则可以稍后包含它们。
标签: mysql sql django indexing query-performance