【问题标题】:How can I efficiently do a database massive update?如何有效地进行数据库大规模更新?
【发布时间】:2009-04-09 16:12:32
【问题描述】:

我有一个包含一些重复条目的表。我必须丢弃除一个以外的所有内容,然后更新这个最新的。我尝试过使用临时表和 while 语句,以这种方式:

CREATE TABLE #tmp_ImportedData_GenericData
(
    Id int identity(1,1),
    tmpCode varchar(255)  NULL,
    tmpAlpha3Code varchar(50)  NULL,
    tmpRelatedYear int NOT NULL,
    tmpPreviousValue varchar(255)  NULL,
    tmpGrowthRate varchar(255)  NULL
)

INSERT INTO #tmp_ImportedData_GenericData
SELECT
    MCS_ImportedData_GenericData.Code, 
MCS_ImportedData_GenericData.Alpha3Code,
MCS_ImportedData_GenericData.RelatedYear,
MCS_ImportedData_GenericData.PreviousValue,
MCS_ImportedData_GenericData.GrowthRate
FROM MCS_ImportedData_GenericData
INNER JOIN
(
    SELECT CODE, ALPHA3CODE, RELATEDYEAR, COUNT(*) AS NUMROWS
    FROM MCS_ImportedData_GenericData AS M
    GROUP BY M.CODE, M.ALPHA3CODE, M.RELATEDYEAR
    HAVING count(*) > 1
) AS M2 ON MCS_ImportedData_GenericData.CODE = M2.CODE
    AND MCS_ImportedData_GenericData.ALPHA3CODE = M2.ALPHA3CODE
    AND MCS_ImportedData_GenericData.RELATEDYEAR = M2.RELATEDYEAR
WHERE
(MCS_ImportedData_GenericData.PreviousValue <> 'INDEFINITO')

 -- SELECT * from #tmp_ImportedData_GenericData
 -- DROP TABLE #tmp_ImportedData_GenericData

DECLARE @counter int
DECLARE @rowsCount int

SET @counter = 1

SELECT @rowsCount =  count(*) from #tmp_ImportedData_GenericData
-- PRINT @rowsCount

WHILE @counter  < @rowsCount
BEGIN
    SELECT 
        @Code = tmpCode, 
        @Alpha3Code = tmpAlpha3Code, 
        @RelatedYear = tmpRelatedYear, 
        @OldValue = tmpPreviousValue, 
        @GrowthRate = tmpGrowthRate 
    FROM 
        #tmp_ImportedData_GenericData
    WHERE 
        Id = @counter

    DELETE FROM MCS_ImportedData_GenericData 
    WHERE 
        Code = @Code 
        AND Alpha3Code = @Alpha3Code  
        AND RelatedYear = @RelatedYear  
        AND PreviousValue <> 'INDEFINITO' OR PreviousValue IS NULL  

    UPDATE 
        MCS_ImportedData_GenericData 
        SET 
          PreviousValue = @OldValue, GrowthRate = @GrowthRate 
    WHERE 
        Code = @Code 
        AND Alpha3Code = @Alpha3Code  
        AND RelatedYear = @RelatedYear  
        AND MCS_ImportedData_GenericData.PreviousValue ='INDEFINITO'

    SET @counter = @counter + 1
END

但这需要很长时间,即使只有 20000 - 30000 行需要处理。

有人对提高性能有什么建议吗?

提前致谢!

【问题讨论】:

  • 同意,虽然 TSQL 包含循环,但并未针对循环进行优化。
  • 如果这是 Microsoft SQL server 特定的,请标记为 sqlserver。好吧,要么我放弃,要么忽略标签 SQL。

标签: sql sql-update temp-tables


【解决方案1】:
WITH q AS (
        SELECT  m.*, ROW_NUMBER() OVER (PARTITION BY CODE, ALPHA3CODE, RELATEDYEAR ORDER BY CASE WHEN PreviousValue = 'INDEFINITO' THEN 1 ELSE 0 END)
        FROM    MCS_ImportedData_GenericData m
        WHERE   PreviousValue <> 'INDEFINITO'
        )
DELETE
FROM    q
WHERE   rn > 1

【讨论】:

    【解决方案2】:

    Quassnoi 的答案使用 SQL Server 2005+ 语法,所以我认为我应该投入我的 tuppence 值得使用更通用的东西......

    首先,要删除所有重复项,但不删除“原始”,您需要一种区分重复记录的方法。 (Quassnoi 答案的 ROW_NUMBER() 部分)

    在您的情况下,源数据似乎没有标识列(您在临时表中创建了一个)。如果是这样的话,我想到了两个选择:
    1.将标识列添加到数据中,然后删除重复项
    2. 创建“去重”数据集,从原始数据中删除所有内容,然后将去重数据插入到原始数据中

    选项 1 可能类似于... (使用新创建的 ID 字段)

    DELETE
       [data]
    FROM
       MCS_ImportedData_GenericData AS [data]
    WHERE
       id > (
             SELECT
                MIN(id)
             FROM
                MCS_ImportedData_GenericData
             WHERE
                CODE = [data].CODE
                AND ALPHA3CODE = [data].ALPHA3CODE
                AND RELATEDYEAR = [data].RELATEDYEAR
            )
    

    或者...

    DELETE
       [data]
    FROM
       MCS_ImportedData_GenericData AS [data]
    INNER JOIN
    (
       SELECT
          MIN(id) AS [id],
          CODE,
          ALPHA3CODE,
          RELATEDYEAR
       FROM
          MCS_ImportedData_GenericData
       GROUP BY
          CODE,
          ALPHA3CODE,
          RELATEDYEAR
    )
    AS [original]
       ON [original].CODE = [data].CODE
       AND [original].ALPHA3CODE = [data].ALPHA3CODE
       AND [original].RELATEDYEAR = [data].RELATEDYEAR
       AND [original].id <> [data].id
    

    【讨论】:

      【解决方案3】:

      我对所用语法的理解不够完美,无法发布确切的答案,但这是一种方法。

      确定要保留的行(例如,选择值,... from .. where ...)

      在识别时执行更新逻辑(例如,选择值 + 1 ... from ... where ...)

      将选择插入新表。

      删除原始文件,将新名称重命名为原始文件,重新创建所有授权/同义词/触发器/索引/FKs/...(或截断原始文件并从新文件中插入选择)

      显然这有相当大的开销,但如果你想更新/清除数百万行,这将是最快的方法。

      【讨论】:

        猜你喜欢
        • 2021-12-27
        • 1970-01-01
        • 1970-01-01
        • 1970-01-01
        • 2013-01-02
        • 1970-01-01
        • 2011-02-13
        • 2021-09-05
        • 2020-12-01
        相关资源
        最近更新 更多