【问题标题】:Delete Duplicate Records with Same Values删除具有相同值的重复记录
【发布时间】:2013-04-23 21:29:22
【问题描述】:

我有一个 TSQL 语句需要几个小时才能运行。我确定我需要查看导入过程以避免插入重复项,但目前我只想删除所有记录,除了具有重复值的记录。 ParameterValueId 是表上的主键,但我有许多重复的条目需要删除。对于每个 ParameterId、SiteId、MeasurementDateTime 和 ParameterValue,我只需要一条记录。以下是我目前删除重复记录的方法。它查找所有计数 > 1 的值。然后查找具有这些值的第一个 Id,并删除具有与这些值找到的第一个 ID 不匹配的值的所有记录。除了打印语句之外,还有一种更有效的方法来执行此操作。我可以用光标来提高性能吗?

BEGIN TRANSACTION

SET NOCOUNT ON

DECLARE @BeginningRecordCount INT
SET @BeginningRecordCount =
(
    SELECT COUNT(*) 
    FROM ParameterValues
)

DECLARE @ParameterId UNIQUEIDENTIFIER
DECLARE @SiteId UNIQUEIDENTIFIER
DECLARE @MeasurementDateTime DATETIME
DECLARE @ParameterValue FLOAT


DECLARE CDuplicateValues CURSOR FOR
SELECT 
     [ParameterId]
    ,[SiteId]
    ,[MeasurementDateTime]
    ,[ParameterValue]
FROM [ParameterValues]
GROUP BY
     [ParameterId]
    ,[SiteId]
    ,[MeasurementDateTime]
    ,[ParameterValue]
HAVING COUNT(*) > 1

OPEN CDuplicateValues
FETCH NEXT FROM CDuplicateValues INTO
     @ParameterId
    ,@SiteId
    ,@MeasurementDateTime
    ,@ParameterValue

DECLARE @FirstParameterValueId UNIQUEIDENTIFIER
DECLARE @DuplicateRecordsDeleting INT
WHILE @@FETCH_STATUS <> -1
BEGIN
    SET @FirstParameterValueId =
    (
        SELECT TOP 1 ParameterValueId
        FROM ParameterValues
        WHERE
                ParameterId = @ParameterId
            AND SiteId = @SiteId
            AND MeasurementDateTime = @MeasurementDateTime
            AND ParameterValue = @ParameterValue
    )

    SET @DuplicateRecordsDeleting =
    (
        SELECT COUNT(*)
        FROM ParameterValues
        WHERE
                ParameterId = @ParameterId
            AND SiteId = @SiteId
            AND MeasurementDateTime = @MeasurementDateTime
            AND ParameterValue = @ParameterValue
            AND ParameterValueId <> @FirstParameterValueId
    )

    PRINT 'DELETING ' + CAST(@DuplicateRecordsDeleting AS NVARCHAR(50))
        + ' records with values ParameterId : ' + CAST(@ParameterId AS NVARCHAR(50))
        + ', SiteId : ' + CAST (@SiteId AS NVARCHAR(50))
        + ', MeasurementDateTime : ' + CAST(@MeasurementDateTime AS NVARCHAR(50))
        + ', ParameterValue : ' + CAST(@ParameterValue AS NVARCHAR(50))

    DELETE FROM ParameterValues
        WHERE
                ParameterId = @ParameterId
            AND SiteId = @SiteId
            AND MeasurementDateTime = @MeasurementDateTime
            AND ParameterValue = @ParameterValue
            AND ParameterValueId <> @FirstParameterValueId

    FETCH NEXT FROM CDuplicateValues INTO
         @ParameterId
        ,@SiteId
        ,@MeasurementDateTime
        ,@ParameterValue
END
CLOSE CDuplicateValues
DEALLOCATE CDuplicateValues

DECLARE @EndingRecordCount INT
SET @EndingRecordCount =
(
    SELECT COUNT(*) 
    FROM ParameterValues
)

PRINT 'Beginning Record Count   :   ' + CAST(@BeginningRecordCount AS NVARCHAR(50))
PRINT 'Ending Record Count      :   ' + CAST(@EndingRecordCount AS NVARCHAR(50))
PRINT 'Total Records Deleted    :   ' + CAST((@BeginningRecordCount - @EndingRecordCount) AS NVARCHAR(50))

SET NOCOUNT OFF

PRINT 'RUN THE COMMIT OR ROLLBACK STATEMENT AFTER VERIFYING DATA.'
--COMMIT
--ROLLBACK

【问题讨论】:

    标签: sql tsql sql-server-2005 performance


    【解决方案1】:

    将选项与CTEOVER 子句一起使用。 OUTPUT.. INTO 子句将受 DELETE 语句影响的行中的信息保存到 @delParameterValues 表中。此外,在程序主体中,您可以使用此表打印受影响的行。

    DECLARE @delParameterValues TABLE
     (
      ParameterId UNIQUEIDENTIFIER, 
      SiteId UNIQUEIDENTIFIER,
      MeasurementDateTime DATETIME,
      ParameterValue FLOAT,
      DeletedRecordCount int
      )
    
    ;WITH cte AS
    (
     SELECT *, ROW_NUMBER() OVER (PARTITION BY [ParameterId],[SiteId],[MeasurementDateTime],[ParameterValue] ORDER BY 1/0) AS rn,
            COUNT(*) OVER (PARTITION BY [ParameterId],[SiteId],[MeasurementDateTime],[ParameterValue]) AS cnt
     FROM [ParameterValues]
     )
     DELETE cte
     OUTPUT DELETED.[ParameterId], 
            DELETED.[SiteId], 
            DELETED.[MeasurementDateTime],
            DELETED.[ParameterValue],
            DELETED.cnt INTO @delParameterValues
     WHERE rn != 1
    
     SELECT DISTINCT *
     FROM @delParameterValues
    

    SQLFiddle上的演示

    【讨论】:

      【解决方案2】:

      您可以在单个 sql 中完成:

      DELETE p FROM ParameterValues p
      LEFT JOIN
      (SELECT ParameterId, SiteId, MeasurementDateTime, ParameterValue, MAX(ParameterValueId) AS MAX_PARAM
       FROM ParameterValues
       GROUP BY ParameterId, SiteId, MeasurementDateTime, ParameterValue
      ) m
      ON m.ParameterId = p.ParameterId
        AND m.SiteId = p.SiteId
        AND m.MeasurementDateTime = p.MeasurementDateTime
        AND m.ParameterValue = p.ParameterValue
        AND m.MAX_PARAM = p.ParameterValueId
      WHERE m.ParameterId IS NULL
      

      当然它不会打印输出,但是你仍然可以打印前后的行

      【讨论】:

      • 您不希望在m 子表中使用 HAVING 以确保您只获取实际重复的记录吗?这将从每个域中删除一条记录,无论它是否重复。
      • WHERE m.ParameterId IS NULL 正在处理它
      • 无论如何,我建议OP将此DELETE语句转换为SELECT语句并运行它以查看选择了什么。无论选择什么都将被删除
      • where 子句 m.MAX_PARAM = p.ParameterValueId 出现错误。消息 8117,级别 16,状态 1,行 1 操作数数据类型唯一标识符对于最大运算符无效。它是 SQL Server 2005 服务器,是 SQL Server 2008 唯一的表达式吗?
      • 其实我认为错误发生在这里 "MAX(ParameterValueId)" 是否有第一个或前1个等价物?
      猜你喜欢
      • 2017-03-02
      • 2021-03-18
      • 1970-01-01
      • 2016-09-06
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2020-07-15
      相关资源
      最近更新 更多