【问题标题】:Get top N rows per day in Hive - rank()每天在 Hive 中获取前 N 行 - rank()
【发布时间】:2017-02-21 20:54:54
【问题描述】:

我有这张桌子,每一行都捐出一笔:

 sale_date  salesman  sale_item_id
 20170102   JohnSmith       309
 20170102   JohnSmith       292
 20170103   AlexHam          93

我正在努力争取每天排名前 20 的销售员,我想出了这个:

SELECT sale_date, salesman, sale_count, row_num
FROM (
  SELECT sale_date, salesman,
         count(*) as sale_count,
         rank() over (partition by sale_date order by sale_count desc) as row_num
  from salesforce.sales_data
) T
WHERE sale_date between  '20170101' and '20170110'
 and row_num <= 20

但我明白了:

FAILED: SemanticException Failed to breakup Windowing invocations into Groups. At least 1 group must only depend on input columns. Also check for circular dependencies.
Underlying error: org.apache.hadoop.hive.ql.parse.SemanticException: Line 5:35 Expression not in GROUP BY key 'sale_date'

我不确定分组会在什么时候生效。有人可以帮忙吗?发送!

【问题讨论】:

    标签: sql hive rank


    【解决方案1】:

    您在子查询中缺少group by

    SELECT sale_date, salesman, sale_count, row_num
    FROM (SELECT sale_date, salesman,
                 count(*) as sale_count,
                 rank() over (partition by sale_date order by count(*) desc) as row_num
          FROM salesforce.sales_data
          GROUP BY sale_date, salesman
         ) T
    WHERE sale_date between '20170101' and '20170110' and row_num <= 20;
    

    我认为 Hive 将接受 order byorder by sale_count desc 中的列别名。

    另请注意,如果存在平局,您可以获得的行数可能多于或少于 20 行。如果您正好需要 20 行,您可能需要 row_number()

    【讨论】:

    • 谢谢@Gordon - 我现在遇到了同样的错误,但是“表达式不在 GROUP BY 键 'sale_count' 中”。 AFAIK 别名不能在组子句中使用,但为了它,我将它添加到组子句并得到“无效的表别名或列引用'sale_count'”
    • @lake 。 . .如果排名是在一个聚合上,你会这样做。
    • 我只看到了销售额,mb
    【解决方案2】:

    试试这个

    SELECT sale_date, salesman, sale_count, row_num from (
    SELECT sale_date, salesman, sale_count,
     rank() over (partition by sale_date order by sale_count desc) as         row_num
    from 
    (
    SELECT sale_date, salesman,
       count(*) over (partition by salesman) as sale_count
    from  employee
    ) t1
    ) t2  where sale_date between  '20170101' and '20170110'
    and row_num <= 20;
    WHERE sale_date between  '20170101' and '20170110'
    and row_num <= 20
    

    编辑和测试。您的问题本质上是您尝试在计算 over 子句之前使用计数,如果您在销售员的子查询分区中计算计数,它将解决问题。您不能在销售查询中进行分组,如果这样做,您将无权访问 sale_date。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2021-03-24
      • 2014-05-17
      • 2016-01-23
      相关资源
      最近更新 更多