从 Teradata 中包含 1000 万条记录的表中删除 1000 个重复项答案

【问题标题】：Delete 1000 duplicates from a table contain 10 million record in Teradata从 Teradata 中包含 1000 万条记录的表中删除 1000 个重复项
【发布时间】：2017-11-13 03:42:28
【问题描述】：

我在一次面试过程中遇到了以下问题。

表“TableA”包含 1000 万条记录。其中有近 1000 条重复记录。我们如何才能以最有效的方式删除这些重复项？

有人可以提供最高效的解决方案吗？

我想出的解决方案是，

创建临时表：用数据创建表 tmp as (select distinct * from TableA)

删除原始表格

从Tmp重新插入数据到TableA

【问题讨论】：

见stackoverflow.com/a/19549032/2527905

标签： sql duplicates teradata query-performance

【解决方案1】：

我没有 teradata 可以试一试。但是你可以使用这样的东西：

delete 
from table 
where table.rowid not in 
(
select max(table.rowid) 
from table 
group by col1,col2,col3.....
)

【讨论】：

【解决方案2】：

从 table 中获取 select 语句中的 Primary 列。例如：EmailID、Mobile 或任何唯一主键值并尝试此查询。 /

select t1.column1, t1.column2 from table1 t1 where( t1.column , t1.column2) in ( select t2.column1, t2.column2 from table1 t2 group by t2.column1, t2.column2 有 count(* )>1);

【讨论】：

【解决方案3】：

由于假设表的容量如此之大，因此使用 volatile 表来保存重复数据删除记录可能是最明智的方法。它也可能是最有效的（虽然还没有测试过）。

类似下面的内容是有意义的：

/*A hypothetical very large table*/
CREATE MULTISET VOLATILE TABLE testtable
(
    f1 integer,
    f2 integer,
    f3 DATE
) 
    PRIMARY INDEX (f1, f2) 
    ON COMMIT PRESERVE ROWS;

INSERT INTO testtable VALUES (1,1, DATE '2017-01-01');
INSERT INTO testtable VALUES (1,2, DATE '2017-01-01');
INSERT INTO testtable VALUES (1,1, DATE '2017-02-01');
INSERT INTO testtable VALUES (1,3, DATE '2017-01-01');
INSERT INTO testtable VALUES (1,3, DATE '2017-01-03');

/*assuming a key of f1 and f2 to identify a duplicate
 *and assuming that if we encounter a duplicate we want
 *to keep the newest one by the f3 date, then:
 *generate a volatile table to hold deduped recordsa
 *using a QUALIFY clause to perform duplicate identification
 */
CREATE MULTISET VOLATILE TABLE testtable_dedup AS
(
    SELECT * FROM testtable
    QUALIFY ROW_NUMBER() OVER (PARTITION BY f1, f2 /*key*/ ORDER BY f3 desc /*date for each key sorted descending*/) = 1 /*keep the newest record*/
) WITH DATA 
 PRIMARY INDEX (f1, f2)
 ON COMMIT PRESERVE ROWS;

/*show what records are being dropped*/
SELECT * FROM testtable
MINUS
SELECT * FROM testtable_dedup;

/*Delete everything*/
DELETE FROM testtable ALL;

/*And reload from the dedup volatile table*/
INSERT INTO testtable SELECT * FROM testtable_dedup;

SELECT * FROM testtable;

/*Clean up*/
DROP TABLE testtable_dedup;
DROP TABLE testtable;

【讨论】：