【问题标题】:What is the equivalent of CUME_DIST() in SQL Server 2008?SQL Server 2008 中的 CUME_DIST() 等价物是什么?
【发布时间】:2012-05-16 00:39:20
【问题描述】:

SQL Server 2012 似乎引入了CUME_DIST()PERCENT_RANK,它们用于计算列的累积分布。 SQL Server 2008 中是否有等效的功能来实现这一点?

【问题讨论】:

    标签: sql sql-server sql-server-2008 tsql analytics


    【解决方案1】:

    永远不要说永远,在 SQL 中。

    声明:

    select percent_rank() over (partition by <x> order by <y>)
    

    本质上等同于:

    select row_number() over (partition by <x> order by <y>) / count(*) over (partition by <x>)
    

    本质上意味着它在数据中没有重复项时有效。即使有重复,它也应该足够接近。

    “真正”的答案是它相当于:

    select row_number() over (partition by <x> order by <y>) / count(distinct <y>) over (partition by <x>)
    

    但是,我们没有 count(distinct) 作为函数。而且,除非你真的需要,否则在 2008 年表达是一种痛苦。

    函数 cume_dist() 更难,因为它需要一个累积和,并且你需要一个自连接。假设没有重复的近似值:

    with t as (select <x>, <y>,
                      row_number() over (partition by <x> order by <y>) as seqnum
               from <table>
              )
    select t.*, sumy*1.0 / sum(sumy) over (partition by <x>)
    from (select t.*, sum(tprev.y) as sumy
          from t left outer join
               t tprev
               on t.x = tprev.x and t.seqnum >= tprev.seqnum
         ) t
    

    【讨论】:

      【解决方案2】:

      在 2012 年之前不存在等效函数,但一种可能的解决方法涉及递归 CTE,至少对于

      SET NOCOUNT ON;
      
      DECLARE @t TABLE(i INT);
      DECLARE @i INT=0;
      
      WHILE @i<30 BEGIN
      INSERT INTO @t VALUES (CAST(RAND()*6 AS INT)+1 + CAST(RAND()*6 AS INT)+1);
          SET @i+=1;
      END
      
      DECLARE @tc INT; SELECT @tc=COUNT(*) FROM @t;
      
      WITH a AS (
          SELECT *
          , d=CAST(COUNT(1)OVER(PARTITION BY i ORDER BY i) AS DECIMAL(5,2)) / @tc
          , r=ROW_NUMBER()OVER(ORDER BY i)
          , pr=CAST((RANK()OVER(ORDER BY i)-1)AS DECIMAL(5,2)) / (@tc - 1)
          FROM @t
      )
      , rcte (i, d, r, cd, pr) AS (
          SELECT i, d, r, d, pr
          FROM a
          WHERE r=1
      
          UNION ALL
      
          SELECT a.i, a.d, a.r
          , CASE WHEN rcte.i<>a.i THEN CAST(rcte.cd+a.d AS DECIMAL(5,2)) ELSE rcte.cd END
          , a.pr
          FROM a
          INNER JOIN rcte ON rcte.r + 1 = a.r
      )
      SELECT i,cd,pr FROM rcte
      OPTION (MAXRECURSION 32767)
      

      结果:

      i           cd                                      pr
      ----------- --------------------------------------- ---------------------------------------
      2           0.0333333333333                         0.0000000000000
      3           0.0700000000000                         0.0344827586206
      4           0.2400000000000                         0.0689655172413
      4           0.2400000000000                         0.0689655172413
      4           0.2400000000000                         0.0689655172413
      4           0.2400000000000                         0.0689655172413
      4           0.2400000000000                         0.0689655172413
      5           0.3100000000000                         0.2413793103448
      5           0.3100000000000                         0.2413793103448
      6           0.3800000000000                         0.3103448275862
      6           0.3800000000000                         0.3103448275862
      7           0.5100000000000                         0.3793103448275
      7           0.5100000000000                         0.3793103448275
      7           0.5100000000000                         0.3793103448275
      7           0.5100000000000                         0.3793103448275
      8           0.6100000000000                         0.5172413793103
      8           0.6100000000000                         0.5172413793103
      8           0.6100000000000                         0.5172413793103
      9           0.8400000000000                         0.6206896551724
      9           0.8400000000000                         0.6206896551724
      9           0.8400000000000                         0.6206896551724
      9           0.8400000000000                         0.6206896551724
      9           0.8400000000000                         0.6206896551724
      9           0.8400000000000                         0.6206896551724
      9           0.8400000000000                         0.6206896551724
      10          0.8700000000000                         0.8620689655172
      11          0.9700000000000                         0.8965517241379
      11          0.9700000000000                         0.8965517241379
      11          0.9700000000000                         0.8965517241379
      12          1.0000000000000                         1.0000000000000
      

      以下是与上述 CTE 等效的 SQL 2012:

      SELECT *
      , cd=CUME_DIST()OVER(ORDER BY i)
      , pr=PERCENT_RANK()OVER(ORDER BY i)
      FROM @t;
      

      【讨论】:

        【解决方案3】:

        这非常接近。首先是一些示例数据:

        USE tempdb;
        GO
        
        CREATE TABLE dbo.DartScores
        (
            TournamentID INT,
            PlayerID INT,
            Score INT
        );
        
        INSERT dbo.DartScores VALUES
        (1, 1, 320),
        (1, 2, 340),
        (1, 3, 310),
        (1, 4, 370),
        (2, 1, 310),
        (2, 2, 280),
        (2, 3, 370),
        (2, 4, 370);    
        

        现在,查询的 2012 版本:

        SELECT TournamentID, PlayerID, Score, 
          pr = PERCENT_RANK() OVER (PARTITION BY TournamentID ORDER BY Score),
          cd = CUME_DIST()    OVER (PARTITION BY TournamentID ORDER BY Score)
        FROM dbo.DartScores
        ORDER BY TournamentID, pr;
        

        产生这个结果:

        TournamentID PlayerID Score pr                  cd
        1            3        310   0                   0.25
        1            1        320   0.333333333333333   0.5
        1            2        340   0.666666666666667   0.75
        1            4        370   1                   1
        2            2        280   0                   0.25
        2            1        310   0.333333333333333   0.5
        2            3        370   0.666666666666667   1
        2            4        370   0.666666666666667   1
        

        2005 年的等价物非常接近,但它不能很好地处理关系。抱歉,我今晚没油了,否则我会帮忙找出原因。我从 Itzik's new High Performance window function book 学到的东西都差不多。

        ;WITH cte AS
        (
            SELECT TournamentID, PlayerID, Score,
             rk = RANK()   OVER (PARTITION BY TournamentID ORDER BY Score),
             rn = COUNT(*) OVER (PARTITION BY TournamentID)
            FROM dbo.DartScores
        )
        SELECT TournamentID, PlayerID, Score,
          pr = 1e0*(rk-1)/(rn-1),
          cd = 1e0*(SELECT COALESCE(MIN(cte2.rk)-1, cte.rn)
            FROM cte AS cte2 WHERE cte2.rk > cte.rk) / rn
        FROM cte;
        

        产生这个结果(注意 cume_dist 值是如何随着关系发生轻微变化的):

        TournamentID PlayerID Score pr                  cd
        1            3        310   0                   0.25
        1            1        320   0.333333333333333   0.5
        1            2        340   0.666666666666667   0.75
        1            4        370   1                   1
        2            2        280   0                   0.25
        2            1        310   0.333333333333333   0.5
        2            3        370   0.666666666666667   0.75
        2            4        370   0.666666666666667   0.75
        

        别忘了清理:

        DROP TABLE dbo.DartScores;
        

        【讨论】:

          【解决方案4】:

          是的,有一个简单的解决方案,至少对于 percent_rank() 部分。你可以使用

          (rank() over (partition by <x> order by <y>)-1)/(count(*) over (partition by <x>)-1)
          

          这将为您提供与

          完全相同的结果
          percent_rank() over (partition by <x> order by <y>)
          

          rank() 函数是 SQL Server 2008 中已经存在的少数分析函数之一。

          【讨论】:

          • OP 询问 CUME_DIST
          • ...OP 还询问百分比排名。
          猜你喜欢
          • 1970-01-01
          • 2020-08-12
          • 1970-01-01
          • 1970-01-01
          • 2010-12-01
          • 1970-01-01
          • 2010-09-24
          • 1970-01-01
          • 1970-01-01
          相关资源
          最近更新 更多