【问题标题】:Calculate a weighted (Bayesian) average score/index in stored procedure?计算存储过程中的加权(贝叶斯)平均分数/索引?
【发布时间】:2012-04-15 21:04:17
【问题描述】:

我有一个 MS SQL Server 2008 数据库,用于存储提供食物的场所(咖啡馆、餐馆、小餐馆等)。在连接到该数据库的网站上,人们可以从 1 到 3 对这些地方进行评分。

在网站上有一个页面,人们可以在其中查看某个城市中排名前 25 位(评分最高)的热门地点列表。数据库结构看起来像这样(表中存储了更多信息,但这里是相关信息):

一个地方位于一个城市中,并且投票放置在一个地方上。

到目前为止,我刚刚计算了每个地方的平均投票分数,其中我将某个地方的所有投票总和除以该地方的投票数,如下所示(伪代码):

vote_count = total number of votes for the place
vote_sum = total sum of all the votes for the place

vote_score = vote_sum/vote_count

如果一个地方没有选票,我还必须处理除以零。所有这些都是在存储过程中完成的,该存储过程获取我想在顶部列表中显示的其他数据。这是当前的存储过程,它获取投票得分最高的前 25 个位置:

ALTER PROCEDURE [dbo].[GetTopListByCity]
    (
    @city_id Int
    )
AS
    SELECT TOP 25 dbo.Places.place_id, 
           dbo.Places.city_id,
           dbo.Places.place_name,
           dbo.Places.place_alias,
           dbo.Places.place_street_address,
           dbo.Places.place_street_number,
           dbo.Places.place_zip_code,
           dbo.Cities.city_name,
           dbo.Cities.city_alias,
           dbo.Places.place_phone,
           dbo.Places.place_lat,
           dbo.Places.place_lng,
           ISNULL(SUM(dbo.Votes.vote_score),0) AS vote_sum,
           (SELECT COUNT(*) FROM dbo.Votes WHERE dbo.Votes.place_id = dbo.Places.place_id) AS vote_count,
           COALESCE((CONVERT(FLOAT,SUM(dbo.Votes.vote_score))/(CONVERT(FLOAT,(SELECT COUNT(*) FROM dbo.Votes WHERE dbo.Votes.place_id = dbo.Places.place_id)))),0) AS vote_score

    FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
    LEFT OUTER JOIN dbo.Votes ON dbo.Places.place_id = dbo.Votes.place_id
    WHERE dbo.Places.city_id = @city_id
    AND dbo.Places.hidden = 0
    GROUP BY dbo.Places.place_id,
             dbo.Places.city_id,
             dbo.Places.place_name,
             dbo.Places.place_alias,
             dbo.Places.place_street_address,
             dbo.Places.place_street_number,
             dbo.Places.place_zip_code,
             dbo.Cities.city_name,
             dbo.Cities.city_alias,
             dbo.Places.place_phone,
             dbo.Places.place_lat,
             dbo.Places.place_lng
    ORDER BY vote_score DESC, vote_count DESC, place_name ASC

    RETURN

如您所见,它获取的不仅仅是投票分数 - 我需要有关该地点、其所在城市等的数据。这很好,但是有一个大问题:投票分数太简单了,因为它没有考虑投票的数量。使用简单的计算方法,一票得分为 3 的地方最终会在列表中高于十四票得分为 3 且一票得分为 2 的地方:

3/1 = 3
(14*3 + 1*2) = 44/15 = 2.933333333333

为了解决这个问题,我一直在研究使用某种形式的加权平均/加权指数。我发现了一个看起来很有希望的真正贝叶斯估计的例子。它看起来像这样:

weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C

where:

R = average for the place (mean) = (Rating)
v = number of votes for the place = (votes)
m = minimum number of votes required to be listed in the Top 25 (unsure how many, but somewhere between 2-5 seems realistic)
C = the mean vote across the whole database

当我尝试在存储过程中实现此加权评级时,问题就开始了 - 它很快变得复杂,我陷入括号中,并且对存储过程的作用不了解。

现在我需要帮助解决两个问题:

这是为我的网站计算加权指数的合适方法吗?

当在存储过程中实现时,这个(或其他合适的计算方法)会是什么样子?

【问题讨论】:

    标签: sql-server-2008 tsql stored-procedures bayesian weighted-average


    【解决方案1】:

    我看不出你的计算有什么问题。但我可以看到你多次做同样的事情。我的建议将帮助您在一个地方进行聚合,然后选择非常容易。

    ;WITH CTE
    (
        SELECT
            SUM(dbo.Votes.vote_score) AS SumOfVoteScore,
            COUNT(*) AS CountOfVotes,
            Votes.place_id
        FROM
            Votes
        GROUP BY
            Votes.place_id
    )
     SELECT TOP 25 
        dbo.Places.place_id, 
        dbo.Places.city_id,
        dbo.Places.place_name,
        dbo.Places.place_alias,
        dbo.Places.place_street_address,
        dbo.Places.place_street_number,
        dbo.Places.place_zip_code,
        dbo.Cities.city_name,
        dbo.Cities.city_alias,
        dbo.Places.place_phone,
        dbo.Places.place_lat,
        dbo.Places.place_lng,
        ISNULL(CTE.SumOfVoteScore,0) AS vote_sum,
        CTE.CountOfVotes AS vote_count,
        COALESCE((CONVERT(FLOAT,CTE.SumOfVoteScore)/
        (CONVERT(FLOAT,CTE.CountOfVotes))),0) AS vote_score
    
    FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
    LEFT JOIN CTE ON dbo.Places.place_id=CTE.place_id
    WHERE dbo.Places.city_id = @city_id
    AND dbo.Places.hidden = 0
    GROUP BY dbo.Places.place_id,
             dbo.Places.city_id,
             dbo.Places.place_name,
             dbo.Places.place_alias,
             dbo.Places.place_street_address,
             dbo.Places.place_street_number,
             dbo.Places.place_zip_code,
             dbo.Cities.city_name,
             dbo.Cities.city_alias,
             dbo.Places.place_phone,
             dbo.Places.place_lat,
             dbo.Places.place_lng
    ORDER BY vote_score DESC, vote_count DESC, place_name ASC
    

    CTE 函数帮助我们重用计算。这样我们就不必多次使用SUM(vote_score)SELECT COUNT(*) FROM Votes WHERE...。因此,当您选择计算时很容易遵循。

    希望对你有帮助

    编辑

    您不必在 CTE 中定义表列。这个CTE (SumOfVoteScore, CountOfVotes, place_id) AS 和这个CTE AS 一样好用。如果您使用递归 cte,则需要定义列。因为你和另一半是union

    对于参考herehere,您将找到有关 CTE 函数的一些信息

    【讨论】:

      【解决方案2】:

      感谢阿里昂!

      我一直在寻找与 CTE 类似的东西,但我只是不知道这是我在寻找的东西!学习新东西总是很好,我知道我会在其他项目中使用 CTE。当我在我的存储过程中实现你的 CTE 时,我得到了这个代码:

      ALTER PROCEDURE dbo.GetTopListByCityCTE
          (
          @city_id Int
          )
      AS
      
      ;WITH CTE (SumOfVoteScore, CountOfVotes, place_id) AS
      (
          SELECT
              SUM(dbo.Votes.vote_score) AS SumOfVoteScore,
              COUNT(*) AS CountOfVotes,
              Votes.place_id
          FROM
              Votes
          GROUP BY
              Votes.place_id
      
      )
      
       SELECT TOP 25 
          dbo.Places.place_id, 
          dbo.Places.city_id,
          dbo.Places.place_name,
          dbo.Places.place_alias,
          dbo.Places.place_street_address,
          dbo.Places.place_street_number,
          dbo.Places.place_zip_code,
          dbo.Cities.city_name,
          dbo.Cities.city_alias,
          dbo.Places.place_phone,
          dbo.Places.place_lat,
          dbo.Places.place_lng,
          ISNULL(CTE.SumOfVoteScore,0) AS vote_sum,
          CTE.CountOfVotes AS vote_count,
          COALESCE((CONVERT(FLOAT,CTE.SumOfVoteScore)/
          (CONVERT(FLOAT,CTE.CountOfVotes))),0) AS vote_score
      
      FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
      LEFT JOIN CTE ON dbo.Places.place_id = CTE.place_id
      WHERE dbo.Places.city_id = @city_id
      AND dbo.Places.hidden = 0
      GROUP BY dbo.Places.place_id,
               dbo.Places.city_id,
               dbo.Places.place_name,
               dbo.Places.place_alias,
               dbo.Places.place_street_address,
               dbo.Places.place_street_number,
               dbo.Places.place_zip_code,
               dbo.Cities.city_name,
               dbo.Cities.city_alias,
               dbo.Places.place_phone,
               dbo.Places.place_lat,
               dbo.Places.place_lng,
               CTE.SumOfVoteScore,
               CTE.CountOfVotes
      ORDER BY vote_score DESC, vote_count DESC, place_name ASC
      

      快速检查表明它返回的结果与之前的代码相同,但它更易于阅读和遵循,并且希望效率更高。

      现在我将不得不做一些试验,用一种考虑投票数的新方法来替换旧的(简单的)评分计算方法。

      【讨论】:

      • 这样做.. 很高兴为您提供帮助。如果您对我的回答满意,您可以考虑接受它吗?
      • 我只是想确保 CTE 帮助我解决原始问题(实现更复杂的分数索引),然后再将您的答案标记为解决方案。我现在正在处理新的存储过程...
      • 好的。我只是提醒你。因为有忘记它的习惯:D
      • 是的,我知道!太多的用户只是复制答案,解决他们的问题而忘记标记解决方案。 :-(
      【解决方案3】:

      好的 - 这是我想出的存储过程:

      ALTER PROCEDURE dbo.GetTopListByCityCTE
          (
          @city_id Int
          )
      AS
      
      DECLARE @MinimumNumber float;
      DECLARE @TotalNumberOfVotes int;
      DECLARE @AverageRating float;
      DECLARE @AverageNumberOfVotes float;
      
      /* MINIMUM NUMBER */
      SET @MinimumNumber = 1;
      
      /* TOTAL NUMBER OF VOTES -- ALL PLACES */
      SET @TotalNumberOfVotes = (
          SELECT COUNT(*) FROM Votes
      );
      
      /* AVERAGE RATING -- ALL PLACES */
      SET @AverageRating = (
          SELECT
              CONVERT(FLOAT,(SUM(dbo.Votes.vote_score))) / CONVERT(FLOAT,COUNT(*)) AS AverageRating
          FROM 
              Votes);
      
      /* AVERAGE NUMBER OF VOTES -- ALL PLACES */
      /* CURRENTLY NOT USED IN INDEX - KEPT FOR REFERENCE */
      SET @AverageNumberOfVotes = (
          SELECT AVG(CONVERT(FLOAT,NumberOfVotes)) FROM (SELECT COUNT(*) AS NumberOfVotes FROM Votes GROUP BY place_id) AS AverageNumberOfVotes
      
      );
      /* SUM OF ALL VOTE SCORES AND COUNT OF ALL VOTES -- INDIVIDUAL PLACES */
      WITH CTE AS (
          SELECT
              CONVERT(FLOAT, SUM(dbo.Votes.vote_score)) AS SumVotesForPlace,
              CONVERT(FLOAT, COUNT(*)) AS CountVotesForPlace,
              Votes.place_id
          FROM
              Votes
          GROUP BY
              Votes.place_id
      )
      
       SELECT 
          dbo.Places.place_id, 
          dbo.Places.city_id,
          dbo.Places.place_name,
          dbo.Places.place_alias,
          dbo.Places.place_street_address,
          dbo.Places.place_street_number,
          dbo.Places.place_zip_code,
          dbo.Cities.city_name,
          dbo.Cities.city_alias,
          dbo.Places.place_phone,
          dbo.Places.place_lat,
          dbo.Places.place_lng,
          ISNULL(CTE.SumVotesForPlace,0) AS vote_sum,
          ISNULL(CTE.CountVotesForPlace,0) AS vote_count,
          COALESCE((CTE.SumVotesForPlace/
          CTE.CountVotesForPlace),0) AS vote_score,
          ISNULL((CTE.CountVotesForPlace / (CTE.CountVotesForPlace + @MinimumNumber)) * (COALESCE((CTE.SumVotesForPlace / CTE.CountVotesForPlace),0)) + (@MinimumNumber / (CTE.CountVotesForPlace + @MinimumNumber)) * @AverageRating,0) AS WeightedIndex
      
      FROM dbo.Places INNER JOIN dbo.Cities ON dbo.Places.city_id = dbo.Cities.city_id
      LEFT JOIN CTE ON dbo.Places.place_id = CTE.place_id
      WHERE dbo.Places.city_id = @city_id
      AND dbo.Places.hidden = 0
      GROUP BY dbo.Places.place_id,
               dbo.Places.city_id,
               dbo.Places.place_name,
               dbo.Places.place_alias,
               dbo.Places.place_street_address,
               dbo.Places.place_street_number,
               dbo.Places.place_zip_code,
               dbo.Cities.city_name,
               dbo.Cities.city_alias,
               dbo.Places.place_phone,
               dbo.Places.place_lat,
               dbo.Places.place_lng,
               CTE.SumVotesForPlace,
               CTE.CountVotesForPlace
      ORDER BY WeightedIndex DESC, vote_count DESC, place_name ASC
      

      有一个名为@AverageNumberOfVotes 的变量未在计算中使用,但我将其保留在那里以备不时之需。

      根据我拥有的数据运行此程序,我得到的结果与我以前得到的结果略有不同,但这不是革命,也不是我所需要的。以下是我执行上述 SP 时返回的前 10 行:

      vote_sum        vote_count  vote_score          WeightedIndex
      1110            409         2,71393643031785    2,7140960047496
      807             310         2,60322580645161    2,60449697749787
      38              15          2,53333333333333    2,56708633093525
      25              10          2,5                 2,55442722744881
      2               1           2                   2,55188848920863
      2               1           2                   2,55188848920863
      2               1           2                   2,55188848920863
      2               1           2                   2,55188848920863
      2               1           2                   2,55188848920863
      2               1           2                   2,55188848920863
      

      这里的问题似乎是只有一票且得分为2的情况下,加权指数变为2,55188848920863?

      计算该指数的公式取自 IMDB (http://www.imdb.com/chart/top),我认为要么我做错了什么,要么我数据库中的数据与数据无法比较(投票数或投票规模)IMDB 有什么?

      编辑

      有没有办法可以调整这个功能,让它更好地为我工作?是否有不同的功能/方法可以更好地工作?我仍然需要在存储过程中进行计算。

      【讨论】:

      • 我不确定这个公式(IMDB 称之为“真正的贝叶斯估计”)是我需要的。并且有人批评:en.wikipedia.org/wiki/…
      猜你喜欢
      • 1970-01-01
      • 2014-09-18
      • 2019-11-05
      • 2015-10-09
      • 2010-10-19
      • 1970-01-01
      • 1970-01-01
      • 2012-01-24
      • 1970-01-01
      相关资源
      最近更新 更多