【发布时间】:2020-12-15 03:51:30
【问题描述】:
我们有一种情况,重复的条目已经爬进了我们的表,有超过 6000 万个条目(这里的重复意味着除了 AUTO_INCREMENT 索引字段之外的所有字段都具有相同的值)。我们怀疑表中有大约 200 万个重复条目。我们希望删除这些重复条目,以便保留重复条目的最早实例。
让我用一个说明性的表格来解释:
CREATE TABLE people
(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
name VARCHAR(40) NOT NULL DEFAULT '',
age INT NOT NULL DEFAULT 0,
phrase VARCHAR(40) NOT NULL DEFAULT '',
PRIMARY KEY (id)
);
INSERT INTO people(name, age, phrase) VALUES ('John Doe', 25, 'qwert'), ('William Smith', 19, 'yuiop'),
('Peter Jones', 19, 'yuiop'), ('Ronnie Arbuckle', 32, 'asdfg'), ('Ronnie Arbuckle', 32, 'asdfg'),
('Mary Evans', 18, 'hjklp'), ('Mary Evans', 18, 'hjklpd'), ('John Doe', 25, 'qwert');
SELECT * FROM people;
+----+-----------------+-----+--------+
| id | name | age | phrase |
+----+-----------------+-----+--------+
| 1 | John Doe | 25 | qwert |
| 2 | William Smith | 19 | yuiop |
| 3 | Peter Jones | 19 | yuiop |
| 4 | Ronnie Arbuckle | 32 | asdfg |
| 5 | Ronnie Arbuckle | 32 | asdfg |
| 6 | Mary Evans | 18 | hjklp |
| 7 | Mary Evans | 18 | hjklpd |
| 8 | John Doe | 25 | qwert |
+----+-----------------+-----+--------+
我们想删除重复的条目,以便得到以下输出:
SELECT * FROM people;
+----+-----------------+-----+--------+
| id | name | age | phrase |
+----+-----------------+-----+--------+
| 1 | John Doe | 25 | qwert |
| 2 | William Smith | 19 | yuiop |
| 3 | Peter Jones | 19 | yuiop |
| 4 | Ronnie Arbuckle | 32 | asdfg |
| 6 | Mary Evans | 18 | hjklp |
| 7 | Mary Evans | 18 | hjklpd |
+----+-----------------+-----+--------+
在较小的表格上,以下方法可行:
CREATE TABLE people_uniq LIKE people;
INSERT INTO people_uniq SELECT * FROM people GROUP BY name, age, phrase;
DROP TABLE people;
RENAME TABLE people_uniq TO people;
SELECT * FROM people;
+----+-----------------+-----+--------+
| id | name | age | phrase |
+----+-----------------+-----+--------+
| 1 | John Doe | 25 | qwert |
| 2 | William Smith | 19 | yuiop |
| 3 | Peter Jones | 19 | yuiop |
| 4 | Ronnie Arbuckle | 32 | asdfg |
| 6 | Mary Evans | 18 | hjklp |
| 7 | Mary Evans | 18 | hjklpd |
+----+-----------------+-----+--------+
请提出一种解决方案,该解决方案可以扩展到包含数千万个条目和更多列的表。我们使用的是 MySQL 版本5.6.49。
【问题讨论】:
-
如果您首先在
name, age, phrase上创建索引,那不会加快SELECT * FROM people GROUP BY name, age, phrase的速度吗?此外,您写道“我们希望删除这些重复条目,以便保留重复条目的最早实例”,但较小尺寸表的示例不一定会保留最早的重复实例。这真的是必要的约束吗?