【问题标题】:Remove Duplicates with Caveats删除带有警告的重复项
【发布时间】:2010-09-14 03:12:32
【问题描述】:

我有一个包含 rowID、经度、纬度、businessName、url、标题的表格。这可能看起来像:

rowID | long  | lat |  businessName | url | caption

  1      20     -20     Pizza Hut   yum.com  null

如何删除所有重复项,但仅保留具有 URL 的副本(第一优先级),或者如果另一个没有 URL 则保留具有标题的副本(第二优先级)并删除休息吗?

【问题讨论】:

  • 重复是基于企业名称吗?
  • 猜测重复是 long + lat + businessName?
  • 副本基于long + lat + businessName,理想情况下,最后只有一个最适合场景的long + lat + businessName。

标签: sql sql-server duplicate-data


【解决方案1】:

这个解决方案是上周“我在 Stack Overflow 上学到的东西”带给你的:

DELETE restaurant
WHERE rowID in 
(SELECT rowID
    FROM restaurant
    EXCEPT
    SELECT rowID 
    FROM (
        SELECT rowID, Rank() over (Partition BY BusinessName, lat, long ORDER BY url DESC, caption DESC ) AS Rank
        FROM restaurant
        ) rs WHERE Rank = 1)

警告:我没有在真实数据库上测试过这个

【讨论】:

    【解决方案2】:

    这是我的循环技术。这可能会因为不是主流而被否决——我对此很满意。

    DECLARE @LoopVar int
    
    DECLARE
      @long int,
      @lat int,
      @businessname varchar(30),
      @winner int
    
    SET @LoopVar = (SELECT MIN(rowID) FROM Locations)
    
    WHILE @LoopVar is not null
    BEGIN
      --initialize the variables.
      SELECT 
        @long = null,
        @lat = null,
        @businessname = null,
        @winner = null
    
      -- load data from the known good row.  
      SELECT
        @long = long,
        @lat = lat,
        @businessname = businessname
      FROM Locations
      WHERE rowID = @LoopVar
    
      --find the winning row with that data
      SELECT top 1 @Winner = rowID
      FROM Locations
      WHERE @long = long
        AND @lat = lat
        AND @businessname = businessname
      ORDER BY
        CASE WHEN URL is not null THEN 1 ELSE 2 END,
        CASE WHEN Caption is not null THEN 1 ELSE 2 END,
        RowId
    
      --delete any losers.
      DELETE FROM Locations
      WHERE @long = long
        AND @lat = lat
        AND @businessname = businessname
        AND @winner != rowID
    
      -- prep the next loop value.
      SET @LoopVar = (SELECT MIN(rowID) FROM Locations WHERE @LoopVar < rowID)
    END
    

    【讨论】:

    • 我使用了非常相似的方法。这种类型的循环也比 CURSOR 快。它还有一个好处是它不会与服务器的 CPU 挂钩。我将类似的代码放在您在问题中链接的另一篇文章中。
    • 如果 rowID 是 char(11) 变量怎么办?它是主键,但你可以在字符串中选择 min(foo) 吗?
    • 为了使某些类型成为真正的主键,它必须在表上建立排序。按 char(11) 排序没有问题。
    【解决方案3】:

    基于集合的解决方案:

    delete from T as t1
    where /* delete if there is a "better" row
             with same long, lat and businessName */
      exists(
        select * from T as t2 where
          t1.rowID <> t2.rowID
          and t1.long = t2.long
          and t1.lat = t2.lat
          and t1.businessName = t2.businessName 
          and
            case when t1.url is null then 0 else 4 end
              /* 4 points for non-null url */
            + case when t1.businessName is null then 0 else 2 end
              /* 2 points for non-null businessName */
            + case when t1.rowID > t2.rowId then 0 else 1 end
              /* 1 point for having smaller rowId */
            <
            case when t2.url is null then 0 else 4 end
            + case when t2.businessName is null then 0 else 2 end
            )
    

    【讨论】:

      【解决方案4】:
      delete MyTable
      from MyTable
      left outer join (
              select min(rowID) as rowID, long, lat, businessName
              from MyTable
              where url is not null
              group by long, lat, businessName
          ) as HasUrl
          on MyTable.long = HasUrl.long
          and MyTable.lat = HasUrl.lat
          and MyTable.businessName = HasUrl.businessName
      left outer join (
              select min(rowID) as rowID, long, lat, businessName
              from MyTable
              where caption is not null
              group by long, lat, businessName
          ) HasCaption
          on MyTable.long = HasCaption.long
          and MyTable.lat = HasCaption.lat
          and MyTable.businessName = HasCaption.businessName
      left outer join (
              select min(rowID) as rowID, long, lat, businessName
              from MyTable
              where url is null
                  and caption is null
              group by long, lat, businessName
          ) HasNone 
          on MyTable.long = HasNone.long
          and MyTable.lat = HasNone.lat
          and MyTable.businessName = HasNone.businessName
      where MyTable.rowID <> 
              coalesce(HasUrl.rowID, HasCaption.rowID, HasNone.rowID)
      

      【讨论】:

        【解决方案5】:

        与另一个答案类似,但您想根据行号而不是排名来删除。也可以与常用的表表达式混合:

        
        ;WITH GroupedRows AS
        (   SELECT rowID, Row_Number() OVER (Partition BY BusinessName, lat, long ORDER BY url DESC, caption DESC) rowNum 
            FROM restaurant
        )
        DELETE r
        FROM restaurant r
        JOIN GroupedRows gr ON r.rowID = gr.rowID
        WHERE gr.rowNum > 1
        

        【讨论】:

          【解决方案6】:

          如果可能的话,你能同质化,然后去除重复吗?

          第 1 步:

          UPDATE BusinessLocations
          SET BusinessLocations.url = LocationsWithUrl.url
          FROM BusinessLocations
          INNER JOIN (
            SELECT long, lat, businessName, url, caption
            FROM BusinessLocations 
            WHERE url IS NOT NULL) LocationsWithUrl 
              ON BusinessLocations.long = LocationsWithUrl.long
              AND BusinessLocations.lat = LocationsWithUrl.lat
              AND BusinessLocations.businessName = LocationsWithUrl.businessName
          
          UPDATE BusinessLocations
          SET BusinessLocations.caption = LocationsWithCaption.caption
          FROM BusinessLocations
          INNER JOIN (
            SELECT long, lat, businessName, url, caption
            FROM BusinessLocations 
            WHERE caption IS NOT NULL) LocationsWithCaption 
              ON BusinessLocations.long = LocationsWithCaption.long
              AND BusinessLocations.lat = LocationsWithCaption.lat
              AND BusinessLocations.businessName = LocationsWithCaption.businessName
          

          第 2 步: 删除重复项。

          【讨论】:

            猜你喜欢
            • 2012-05-19
            • 2016-06-01
            • 1970-01-01
            • 2018-07-27
            • 1970-01-01
            • 1970-01-01
            • 2018-02-20
            • 1970-01-01
            • 2013-04-16
            相关资源
            最近更新 更多