如何删除重复项，并更新在 SQL 中引用这些重复项的记录答案

【问题标题】：How do I delete duplicates, and update the records that refer to those duplicates in SQL如何删除重复项，并更新在 SQL 中引用这些重复项的记录
【发布时间】：2017-06-12 12:52:26
【问题描述】：

我有两张桌子：

User:(int id, varchar unique username)

Items: (int id, varchar name, int user_id)

目前，用户表中存在不区分大小写的重复项，例如：

1,John
2,john
3,sally
4,saLlY

然后 Items 表就会有

1,myitem,1
2,mynewitem,2
3,my-item,3
4,mynew-item,4

我已经更新了插入用户表的代码，以确保它始终插入小写。

但是，我需要迁移数据库，以便从用户表中删除重复项，并更新项目表引用，以便用户不会失去对其项目的访问权限

I.E迁移后的数据为：

用户：

1,john
3,sally

项目

1,myitem,1
2,mynewitem,1
3,my-item,3
4,mynew-item,3

由于用户表有一个唯一的约束，我不能把它设置为较低的像

update public.user set username =lower(username)

【问题讨论】：

我正在使用 H2 数据库
首先更新项目，使它们都指向用户的正确版本，然后删除不需要的用户。
我可以用 Java 或其他编程语言做到这一点，我想知道是否可以纯粹使用 SQL 来做到这一点
添加按用户名分区的行号然后删除其中行号> 1。msdn.microsoft.com/en-us/library/ms186734.aspx 然后stackoverflow.com/questions/2334712/…

标签： sql h2

【解决方案1】：

以下代码在 Web 控制台上使用“H2 1.3.176 (2014-04-05) / 嵌入式模式”进行测试。正如您所说，有两个查询应该可以解决该问题，并且还有一个额外的准备声明用于考虑一个案例 - 尽管您的数据中没有显示 - 也应该考虑。准备语句稍后会解释；让我们从主要的两个查询开始：

首先，所有items.userids 将被重写为对应的小写名称的用户条目，如下所示：我们将小写条目称为main，将非小写条目称为dup。然后，每个引用dup.id 的items.userid 将被设置为对应的main.id。如果不区分大小写的名称比较匹配，则主条目对应于 dup 条目，即main.name = lower(dup.name)。

其次，将删除用户表中的所有 dup 条目。重复条目是 name <> lower(name)。

到目前为止的基本要求。此外，我们应该考虑到对于某些用户，可能只存在带有大写字符的条目，而没有“小写条目”。为了处理这种情况，使用了一个准备语句，该语句将 - 对于每组常用名称 - 每组中的一个名称设置为小写。

drop table if exists usr;

CREATE TABLE usr
    (`id` int primary key, `name` varchar(5))
;

INSERT INTO usr
    (`id`, `name`)
VALUES
    (1, 'John'),
    (2, 'john'),
    (3, 'sally'),
    (4, 'saLlY'),
    (5, 'Mary'),
    (6, 'mAry')

;

drop table if exists items;

CREATE TABLE items
    (`id` int, `name` varchar(10), `userid` int references usr (`id`))
;

INSERT INTO items
    (`id`, `name`, `userid`)
VALUES
    (1, 'myitem', 1),
    (2, 'mynewitem', 2),
    (3, 'my-item', 3),
    (4, 'mynew-item', 4)
;

update usr set name = lower(name) where id in (select min(ui.id) as minid from usr ui where lower(ui.name) not in (select ui2.name from usr ui2)
group by lower(name));

update items set userid =
(select umain.id as mainid from usr udupl, usr umain
 where umain.name = lower(umain.name)
     and lower(udupl.name) = lower(umain.name)
     and udupl.id = userid
);

delete from usr where name <> lower(name);

select * from usr;

select * from items;

执行上述语句会产生以下结果：

select * from usr;
ID  | NAME
----|-----
2   | john
3   | sally
5   | mary

select * from items;
ID | NAME     |USERID  
---|----------|------
1  |myitem    | 2
2  |mynewitem | 2
3  |my-item   | 3
4  |mynew-item| 3

【讨论】：

【解决方案2】：

如果您首先正确更新项目引用，则可以删除用户重复项。在下面的示例中，我将具有最小 id 的用户保留为正确的用户，如果这不打扰您

--Prepare data
create TABLE #users  
(id int primary key, username varchar(15));

INSERT INTO #users
(id, username)
select
1, 'John'
union all select
2, 'john'
union all select
3, 'sally'
union all select
4, 'saLlY'
union all select
5, 'Mary'
union all select
6, 'mAry'


create TABLE #items  
(itemid int, name varchar(10), userid int references #users (id));

INSERT INTO #items
(itemid, name, userid)
select
1, 'myitem', 1
union all select
2, 'mynewitem', 2
union all select
3, 'my-item', 3
union all select
4, 'mynew-item', 4
;

--Update items
update #items 
set userid =minid 
from
 (
select minid,id from 
(
select min(id) as minid,lower(username) as newusername
from #users group by username) t inner join #users 
on t.newusername = username) t2 inner join #items on t2.id = userid


--delete duplicates users, according to minimum id
delete from #users where id not in (
select min(id) from #users group by lower(username))

--set the remaining users names to lower
update #users
set username = lower(username)

--Clean temp data
drop table #users
drop table #items

这是在sqlserver中测试的，但是你要求的是纯sql，所以我认为它会适合你

【讨论】：

【解决方案3】：

我不擅长 H2。你可以试试这个为 SQL Server 和数据库区分大小写、区分重音而编写的。

create table t_user(id int not null identity(1,1), username varchar(25) unique);
alter table t_user add constraint pk_id_user primary key(id);

create table t_items(id int not null identity(1,1), name varchar(25), user_id int);
alter table t_items add constraint pk_id_items primary key(id);
alter table t_items add constraint fk_user_id foreign key(user_id) references t_user(id);

insert into t_user (username) values ('John'), ('john'), ('sally'), ('saLlY');
insert into t_items (name, user_id) values ('myitem', 1), ('mynewitem', 2), ('my-item', 3), ('mynew-item',4);

select * from t_user
select * from t_items

create table t_user_mig(id int not null identity(1,1), username varchar(25) unique);
alter table t_user_mig add constraint pk_id_user_mig primary key(id);

create table t_items_mig(id int not null identity(1,1), name varchar(25), user_id int);
alter table t_items_mig add constraint pk_id_items_mig primary key(id);
alter table t_items_mig add constraint fk_user_id_mig foreign key(user_id) references t_user_mig(id);

insert into t_user_mig select distinct lower(username) from t_user
insert into t_items_mig
select ti.name, (select id from t_user_mig where username = lower(tu.username)) 
from t_items ti, t_user tu 
where ti.user_id = tu.id

select * from t_user_mig
select * from t_items_mig

我将您的表格 user, items 替换为 t_user, t_items。这些表被迁移到 t_user_mig, t_items_mig。

您可以在 H2 中尝试一下。我会很感激你的反馈。

希望对你有帮助。

【讨论】：

【解决方案4】：

先更新项目：

update items
set userid = u.userid
from items i
   inner join users u on i.iserid=u.userid
   inner join (select userid, username, row_number() over (partition by username order by userid)) u2 on u2.username=u.username and rn=1

然后根据原始数据创建新的用户表：

select userid, lower(username) username 
into NewUserTable
from (select userid, username, row_number() over (partition by username order by userid)) u 
where rn=1

【讨论】：

不确定 H2 是否使用窗口函数。我将把它保留为 SQL Server 解决方案。

【解决方案5】：

此代码在 SQL Server 上完美运行

试试吧，它会对您有所帮助（您可能需要进行简单的更改以符合您的数据库引擎）：-

SELECT U1.id,U2.id id2
INTO #User_Tmp
FROM User U1 JOIN User U2 
ON LOWER(U2.username) = LOWER(U1.username) 
AND U1.id < U2.id

UPDATE It
SET It.user_id = U.id
FROM Items It
JOIN #User_Tmp U
ON U.id2 = It.id

DELETE FROM User
WHERE id IN 
(
    SELECT id2 FROM #User_Tmp
)

SELECT *
FROM User

SELECT *
FROM Items

DROP TABLE #User_Tmp;

希望这能回答问题。

【讨论】：

【解决方案6】：

BEGIN TRAN
CREATE TABLe #User (UserID Int, UserName Nvarchar(255))

INSERT INTO #USER
SELECT 1,'John' UNION ALL
SELECT 2,'John'  UNION ALL
SELECT 3,'sally' UNION ALL
SELECT 4,'saLlY'

CREATE TABLE #items  
(itemid int, name varchar(10), userid int );

INSERT INTO #items
(itemid, name, userid)
select
1, 'myitem', 1
union all select
2, 'mynewitem', 2
union all select
3, 'my-item', 3
union all select
4, 'mynew-item', 4

GO
WITH CTE (USERID, DuplicateCount)
AS
(
    SELECT UserName,
    ROW_NUMBER() OVER(PARTITION BY  UserName
    ORDER BY  UserName) AS DuplicateCount
    FROM #User

)
Delete from CTE Where DuplicateCount > 1

Select * from #User

Select * from #items

ROLLBACK TRAN

【讨论】：

删除重复记录后，您可以简单地更新表格

【解决方案7】：

尝试使用 MERGE 语句，您可以找出重复项，也可以更新重复项的值。

MERGE [INTO] <target table>

USING <source table or table expression>

ON <join/merge predicate> (semantics similar to outer join)

WHEN MATCHED <statement to run when match found in target>

WHEN [TARGET] NOT MATCHED <statement to run when no match found in target>

【讨论】：