优化 SQL 查询以在包含大量数据的表中查找重复项答案

【问题标题】：Optimising SQL query to find duplicates in a table with large amounts of data优化 SQL 查询以在包含大量数据的表中查找重复项
【发布时间】：2017-05-23 07:52:58
【问题描述】：

我在 SQL Server 2014 上有下表：

id          field1         field2
----------- ---------------------------------
1           1                 a
2           2                 a
3           3                 a
4           3                 b
5           4                 a
6           5                 a
7           6                 b
8           1                 a
9           2                 a
10          3                 c
11          4                 b
12          4                 c
13          5                 b

现在我想查找字段 1 中的重复值在字段 2 具有不同值时列出的记录，目前我正在使用以下查询来执行此操作：

;with tmp_cte as (
select field1,field2 from mytable (nolock)
group by field1,field2)

select * from tmp_cte cte1
where (select count(field2) from tmp_cte cte2 where cte1.field1=cte2.field1)>1

这是结果：

field1       field2
-------------------------------
3               a
3               b
3               c
4               a
4               b
4               c
5               a
5               b

现在虽然这可行，但在具有大量数据（1.6 亿条记录）的表上速度非常慢，因此我想优化查询，因为目前仅 1 个月的数据（+-10万条记录）任何帮助，将不胜感激。提前致谢。

【问题讨论】：

你听说过 SQL Server 中的Indexes codeproject.com/Articles/190263/Indexes-in-MS-SQL-Server 吗？
当然我有，但索引在这里没有帮助我。以上只是示例，但我的实际表格更复杂，文本字段没有索引（不是我的设计，而是我的问题）。我确信有一种更有效的方法可以达到相同的结果，这就是我在这里的原因，我满怀希望:)

标签： sql-server performance query-optimization common-table-expression

【解决方案1】：

试试这个，

        declare @t table(id int,field1 int,field2 varchar(10))
        insert into @t VALUES
        (1  ,1,  'a'),(2  ,2,  'a'),(3  ,3,  'a')
        ,(4  ,3,  'b'),(5  ,4,  'a'),(6  ,5,  'a')
        ,(7  ,6,  'b'),(8  ,1,  'a'),(9  ,2,  'a')
        ,(10 ,3,  'c'),(11 ,4,  'b'),(12 ,4,  'c')
        ,(13 ,5,  'b')

        ;with CTE as
(
select * 
,DENSE_RANK()over(partition by field1 order by field2)rn

from @t
)


select * from cte c
where  EXISTS(select id from cte c1 where 
c.field1=c1.field1 and c1.rn>1 )

【讨论】：

这个有效，速度比我最初的查询要好得多。唯一需要添加的是字段分组，但我添加了它。谢谢。

【解决方案2】：

试试这个代码：

Select field1,field2 from mytable (nolock) where field1 in
    (Select field1 from mytable (nolock) group by field1 having count(field1)>1)

【讨论】：

此查询还包括字段 2 没有多个值且字段 1 重复的记录

【解决方案3】：

试试这个：

select distinct
    t.field1,
    t.field2 
from
    mytable t
    join
    (select 
        field1
    from 
        mytable
    group by 
        field1
    having
        count(field2) > 1) sub
    on t.field1 = sub.field1
order by
    field1

【讨论】：

此查询还包括field2 上没有多个值的记录，其中field1 具有相同的值。

【解决方案4】：

DEMO HERE

 ;with cte
    as
    (select field1,field2,
    count(  field1) over(partition by field1 order by field2) as cnt,
    dense_rank() over(partition by field1 order by field2) as drnk
     from #temp
    )
    select * from cte where cnt-drnk=0

在表上有以下索引，这个查询可以快速运行

create index nci on table(field1,field2)

【讨论】：