合并在与另一个表的关系中使用的重复表行答案

【问题标题】：Merge duplicate table rows that are used in a relation with another table合并在与另一个表的关系中使用的重复表行
【发布时间】：2021-03-06 09:25:50
【问题描述】：

我的表结构如下：

table_a
id | customer_id | product_id
---+-------------+------
 1 | c1          | p1
 2 | c1          | p1
 3 | c2          | p1

table_b
id | table_a_id  | attribute
---+-------------+------
 99 | 1          | a1
 98 | 2          | a2
 97 | 3          | a3

如您所见，table_a 有重复值，我想合并它们。不幸的是，table_a PK 也用于table_b。

最终结果应该是：

table_a
id | customer_id | product_id
---+-------------+------
 1 | c1          | p1
 3 | c2          | p1

table_b
id | table_a_id  | attribute
---+-------------+------
 99 | 1          | a1
 98 | 1          | a2
 97 | 3          | a3

我必须更新 table_b 与 table_a 的关系，然后清除 table_a 上所有未设置的键。

不幸的是，我想到的唯一查询非常繁重，并且可以完成之前的数据库超时。 table_a 有 200k+ 条记录，table_b 至少是它的两倍。

我的想法是：

加入table_a 和table_b，得到：(table_b_id, table_a_customer_id, table_a_product_id)
获取table_a 的分组版本。（为了得到正确的id table_a 我刚刚使用了min("id")
内联上面两个，用结果更新table_b。

【问题讨论】：

标签： sql postgresql duplicates sql-update sql-delete

【解决方案1】：

这是使用公用表表达式的一种选择：

with 
    ta as (
        select ta.*, min(id) over(partition by customer_id, product_id) min_id
        from table_a ta
    ),
    upd as (
        update table_b tb
        set table_a_id = ta.min_id
        from ta
        where tb.table_a_id = ta.id and ta.id <> ta.min_id
    )
delete from table_a ta1
using ta
where 
    ta1.customer_id = ta.customer_id
    and ta1.product_id = ta.product_id
    and ta1.id > ta.id

第一个 CTE 将目标 id 关联到 table_a 的每一行。然后，我们使用该信息更新table_b。最后我们删除table_a中的重复行，只保留最早的id。

【讨论】：