【发布时间】:2015-05-18 13:24:31
【问题描述】:
我们在MySql中有一个大约3000万条记录的表,下面是表结构
CREATE TABLE `campaign_logs` (
`domain` varchar(50) DEFAULT NULL,
`campaign_id` varchar(50) DEFAULT NULL,
`subscriber_id` varchar(50) DEFAULT NULL,
`message` varchar(21000) DEFAULT NULL,
`log_time` datetime DEFAULT NULL,
`log_type` varchar(50) DEFAULT NULL,
`level` varchar(50) DEFAULT NULL,
`campaign_name` varchar(500) DEFAULT NULL,
KEY `subscriber_id_index` (`subscriber_id`),
KEY `log_type_index` (`log_type`),
KEY `log_time_index` (`log_time`),
KEY `campid_domain_logtype_logtime_subid_index` (`campaign_id`,`domain`,`log_type`,`log_time`,`subscriber_id`),
KEY `domain_logtype_logtime_index` (`domain`,`log_type`,`log_time`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |
以下是我的查询
我正在做 UNION ALL 而不是使用 IN 操作
SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
count(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_OPENED'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date
UNION ALL
SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
COUNT(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_SENT'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date
UNION ALL
SELECT log_type,
DATE_FORMAT(CONVERT_TZ(log_time,'+00:00','+05:30'),'%l %p') AS log_date,
COUNT(DISTINCT subscriber_id) AS COUNT,
COUNT(subscriber_id) AS total
FROM stats.campaign_logs USE INDEX(campid_domain_logtype_logtime_subid_index)
WHERE DOMAIN='xxx'
AND campaign_id='123'
AND log_type = 'EMAIL_CLICKED'
AND log_time BETWEEN CONVERT_TZ('2015-02-01 00:00:00','+00:00','+05:30') AND CONVERT_TZ('2015-03-01 23:59:58','+00:00','+05:30')
GROUP BY log_date,
以下是我的解释声明
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
| 1 | PRIMARY | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 55074 | Using where; Using index; Using filesort |
| 2 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 330578 | Using where; Using index; Using filesort |
| 3 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 1589 | Using where; Using index; Using filesort |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+------------------------------------------+
- 我将 COUNT(subscriber_id) 更改为 COUNT(*) 并没有观察到性能提升。
2.我从查询中删除了 COUNT(DISTINCTsubscriber_id),然后我得到了巨大的 性能提升,我在大约 1.5 秒内得到结果,以前是 需要 50 秒 - 1 分钟。但我需要从查询中区分订阅者 ID 计数
以下是我从查询中删除 COUNT(DISTINCTsubscriber_id) 时的说明
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
| 1 | PRIMARY | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 55074 | Using where; Using index; Using temporary; Using filesort |
| 2 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 330578 | Using where; Using index; Using temporary; Using filesort |
| 3 | UNION | campaign_logs | range | campid_domain_logtype_logtime_subid_index | campid_domain_logtype_logtime_subid_index | 468 | NULL | 1589 | Using where; Using index; Using temporary; Using filesort |
| NULL | UNION RESULT | <union1,2,3> | ALL | NULL | NULL | NULL | NULL | NULL | |
+----+--------------+---------------+-------+-------------------------------------------+-------------------------------------------+---------+------+--------+-----------------------------------------------------------+
- 我通过删除 UNION ALL 分别运行了三个查询。一个查询耗时 32 秒,其他查询耗时 1.5 秒,但第一个查询处理大约 350K 条记录,而其他查询仅处理 2k 行
我可以通过省略 COUNT(DISTINCT...) 来解决我的性能问题,但我需要这些值。有没有办法重构我的查询,或添加索引或其他东西,以获取 COUNT(DISTINCT...) 值,但要快得多?
更新 以下信息是关于上表的数据分布
对于 1 个域 1 个广告系列 20 个日志类型 1k-200k 订阅者
我正在运行的上述查询,域有 180k+ 订阅者。
【问题讨论】:
-
为什么不
AND (log_type = 'EMAIL_OPENED' OR log_type = 'EMAIL_SENT' OR log_type = 'EMAIL_CLICKED') -
删除所有索引,只为 (domain,campaign_id,log_type,log_time) 创建一个组索引
-
尝试在每个
GROUP BY之后添加ORDER BY NULL这可能会消除文件排序。 -
您的
EXPLAIN清楚地表明您的复合索引正在按您的意图使用。一些事情要尝试:1)将COUNT(subscriber_id)更改为COUNT(*),看看性能是否有所提高。 2)尝试摆脱COUNT(DISTINCT subscriber_id),看看性能是否有所提高。运行与UNION ALL组合的三个子查询中的每一个,看看其中一个的性能是否比另外两个差。请使用这些测试的结果更新您的问题。 -
这只是我对引擎内部情况的理解。它可能会激发一些想法。您的索引有助于在 30M 中快速找到这 350K 行。然后引擎必须读取所有这些 350K 行来对它们进行分组和计数。当没有
DISTINCT: 到GROUP它们时,引擎会根据DATE_FORMAT函数的结果对 350K 行进行排序,然后逐步遍历排序结果并以它们出现的任何顺序对行进行计数。当您添加DISTINCT时,引擎必须在每个组内再次排序。一种嵌套排序。显然,这没有得到有效处理。
标签: mysql sql aggregate-functions query-performance mysql-variables