【问题标题】:Filter rows in PostgreSQL based on values of consecutive rows in one column根据一列中连续行的值过滤PostgreSQL中的行
【发布时间】:2017-04-04 02:20:48
【问题描述】:

所以我正在使用以下 postgresql 表:

PostGreSQL 表中的 10 行

对于每个 business_id,我想过滤掉那些 review_count 连续 2 个月(或行) 不超过特定 review_count 阈值的企业。根据 business_id 所在的城市,阈值会有所不同(例如,在上面的屏幕截图中,我们可以假设 city = Charlotte 的行的 review_count 阈值 >= 2,而 city = Las Vegas 的行具有review_count 阈值 >= 3。如果一个 business_id 没有至少一个连续月份实例的 review_counts 高于指定阈值,我想将其过滤掉。

我希望此查询仅返回满足此条件的业务 ID(以及表中与该业务 ID 一起出现的所有其他列)。此表的复合主键为 (business_id, year, month)。

您可能会注意到,数据中缺少某些月份(第二个 business_id 的第 9 个月)。如果是这种情况,我不想将 2 行算作“连续几个月”。例如,对于拉斯维加斯的业务,我不想将第 8 个月到第 10 个月视为“连续月份”,即使它们出现在连续的行中。

我已经尝试过这样的事情,但是有点撞墙并且认为它没有让我走远:

SELECT *
FROM us_business_monthly_review_growth
WHERE business_id IN (SELECT DISTINCT(business_id)
                      FROM us_business_monthly_review_growth
                      GROUP BY business_id, year, month
                      HAVING (city = 'Las Vegas' 
                             AND (CASE WHEN COUNT(review_count >= 2 * 2.21) >= 2))
                             OR (city = 'Charlotte' AND (CASE WHEN COUNT(review_count >= 2 * 1.95) >= 2))

我是 Postgre 和 StackOverflow 的新手,所以如果您对我提出这个问题的方式有任何反馈,请随时告诉我! =)

更新

感谢@Gordon Linoff的帮助,我找到了以下解决方案:

SELECT *
FROM us_businesses_monthly_growth_and_avg
WHERE business_id IN (SELECT distinct(business_id)
FROM (SELECT *,
             lag(year) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_year,
             lag(month) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_month,
             lag(review_count) OVER (PARTITION BY business_id ORDER BY year, month) AS prev_review_count
      FROM us_businesses_monthly_growth_and_avg 
     ) AS usga
WHERE (city = 'Charlotte' AND review_count >= 4 * 1.95 AND prev_review_count >= 4 * 1.95 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1)
        OR (city = 'Las Vegas' AND review_count >= 4 * 3.31 AND prev_review_count >= 4 * 3.31 AND (YEAR * 12 + month) = (prev_year * 12 + prev_month) + 1);

【问题讨论】:

  • 请不要使用不适用于您的问题的标签。
  • 您希望它返回与符合条件的business_id 对应的所有行,还是只返回那些连续且拥有足够评论计数的行?
  • 是的,正确@toonice 我希望它返回 所有 对应于 business_id 的行
  • 作为建议,如果您在发布问题时发布用于创建和填充示例表和示例数据的脚本(如果这些脚本可用或易于创建),那就太好了。虽然这并不总是需要或错过,但它确实有助于开发答案来测试他们的代码。

标签: sql postgresql


【解决方案1】:

您可以使用lag()

select distinct business_id
from (select t.*,
             lag(year) over (partition by business_id order by year, month) as prev_year,
             lag(month) over (partition by business_id order by year, month) as prev_month,
             lag(rating) over (partition by business_id order by year, month) as prev_rating
      from us_business_monthly_review_growth t
     ) t
where rating >= $threshhold and prev_rating >= $threshhold and
      (year * 12 + month) = (prev_year * 12 + prev_month) + 1;

唯一的技巧是设置阈值。我不知道你打算怎么做。

【讨论】:

  • @Gordan 太棒了!澄清一下,您是说您不确定如何根据“城市”值设置不同的阈值?只是想确保我的解释正确,不过这是一个很棒的开始!
  • 还有,这样不是只返回满足条件的business_ids吗?我实际上希望返回“表中与该business_id一起出现的其他列)”每一行都由(business_id,year,month)的组合唯一标识,这是主要的键。
  • @dsc03 。 . .我建议使用joinexistsjoin 来获取其他行。
【解决方案2】:

请尝试...

SELECT business_id
FROM
(       
    SELECT business_id AS business_id,
           LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
           city,
           LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates,
           review_count AS review_count
    FROM us_business_monthly_review_growth
        order BY business_id,
                 year,
                 month
) tempTable
JOIN tblCityThresholds ON tblCityThresholds.city = tempTable.city
WHERE business_id = lag_in_business_id
  AND diffInDates = 1
  AND tblCityThresholds.threshold <= review_count
GROUP BY business_id;

在制定这个答案时,我首先使用以下代码来测试LAG() 是否按预期执行...

SELECT business_id,
       LAG( business_id, 1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
       year,
       month,
       LAG( year, 1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, 1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
         year,
         month;

在这里,我试图让LAG() 引用下一行的值,但输出显示它引用了该比较中的前一行。不幸的是,我想将当前值与下一个值进行比较,以查看下一条记录是否具有相同的business_id 等。所以我将LAG() 中的1 更改为`-1',给了我...

SELECT business_id,
       LAG( business_id, -1 ) OVER ( ORDER BY business_id, year, month ) AS lag_in_business_id,
       year,
       month,
       LAG( year, -1 ) OVER ( ORDER BY business_id, year, month ) * 12 + LAG( month, -1 ) OVER ( ORDER BY business_id, year, month ) AS diffInDates
FROM mytable
ORDER BY business_id,
         year,
         month;

因为这给了我想要的结果,所以我添加了city, 以允许在结果和假设的表格之间添加JOIN,其中包含每个城市的详细信息及其相应的阈值。我选择了名称tblCityThresholds 作为建议,因为我不确定您拥有/会称呼它什么。这完成了内部SELECT 语句。

然后,我将内部SELECT 语句的结果加入tblCityThresholds,并根据您的标准优化输出。注:假设city字段在tblCityThresholds中总会有对应的条目;

然后我使用GROUP BY 来确保不会重复business_id

如果您有任何问题或cmets,请随时发表相应的评论。

进一步阅读

https://www.postgresql.org/docs/8.4/static/functions-window.html(关于LAG()

【讨论】:

    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2021-11-08
    • 2021-12-29
    • 2016-10-05
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多