【问题标题】:Getting percentage from count in Hive从 Hive 中的计数中获取百分比
【发布时间】:2014-09-07 08:21:09
【问题描述】:

我有一张如下表:

COL1    COL2    DATETIMESTAMP   CATEGORY1   CATEGORY2
e-12    1101    201408110525    Arts and Entertainment  Television
e-12    1101    201408110525    Arts and Entertainment  Television
e-12    1101    201408110525    Arts and Entertainment  Television
e-12    1101    201408110620    Technology and Computing    Internet Technology
e-12    1101    201408110705    Technology and Computing    Antivirus Software
e-12    1107    201408110510    Business    Advertising
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1107    201408110520    Business    Marketing
e-12    1109    201408110505    Technology and Computing    Web Search

忽略 COL1(因为它们都是相同的),对于每个 COL2,都有其余字段的组合。我设法得到了重复组合的计数,结果如下:

COL1    COL2    DATETIMESTAMP   CATEGORY1   CATEGORY2   COUNT
e-12    1101    201408110525    Arts and Entertainment  Television  3
e-12    1101    201408110620    Technology and Computing    Internet Technology 1
e-12    1101    201408110705    Technology and Computing    Antivirus Software  1
e-12    1107    201408110510    Business    Advertising 1
e-12    1107    201408110520    Business    Marketing   7
e-12    1109    201408110505    Technology and Computing    Web Search  1

如何将计数转换为每个 COL2 的所有组合的百分比?

很抱歉,我无法更好地用文字表达,但输出应该是这样的:

COL1    COL2    DATETIMESTAMP   CATEGORY1   CATEGORY2   COUNT   PERCENTAGE
e-12    1101    201408110525    Arts and Entertainment  Television  3   60%
e-12    1101    201408110620    Technology and Computing    Internet Technology 1   20%
e-12    1101    201408110705    Technology and Computing    Antivirus Software  1   20%
e-12    1107    201408110510    Business    Advertising 1   12.5%
e-12    1107    201408110520    Business    Marketing   7   87.5%
e-12    1109    201408110505    Technology and Computing    Web Search  1   100%

注意:此时,不需要计数。

这在 Hive 中是否可行?如何修改我的计数查询(如下)以输出最后一个表?

SELECT COL1, COL2, DATETIMESTAMP, CATEGORY1, CATEGORY2, count(*) FROM temp_table GROUP BY CATEGORY1, CATEGORY2, DATETIMESTAMP, COL2, COL1 SORT BY COL2;

谢谢。

【问题讨论】:

  • 您可以使用两个 select 语句分别计算 col2 和 category2,然后在主 select 语句中使用它们

标签: sql hadoop hive


【解决方案1】:

我可以想到几种方法来做到这一点。您可以计算百分比中的分母,并将其连接回原始数据,然后 SUM 并除以总数。此外,如果您可以访问 Hive 中的 windowing functions(我相信他们附带 0.13),您可以在 SELECT 中使用 OVERPARTITION 语句来避免第一部分中描述的连接。

#1:

select col2, cat1, cat2, datetimestamp
    ,(COUNT(cat2) / MAX(total_)) as perc
from (
    select n.col2, cat1, cat2, datetimestamp, x.total_
    from some_table as n
    JOIN (
        select col2, COUNT(col2) as total_
        from some_table
        group by col2
         ) x
    ON x.col2 = n.col2
     ) y
group by cat1, cat2, col2, datetimestamp

#2:

select col2, cat1, cat2, datetimestamp
    ,(COUNT(col2) / MAX(total)) as perc
from (
    select col2, cat1, cat2
        ,COUNT(cat1) OVER (PARTITION BY col2) as total
    from some_table
     ) x
group by cat1, cat2, col2, datetimestamp

【讨论】:

  • 我使用了样本 #2。我遇到了关于datetimestamp 的问题,所以我将它添加到了内部的select 语句中。同样,我将perc 乘以 100,以便更接近地模仿百分号的外观。我的编辑会影响准确性吗?我在上面的示例数据上测试了你的代码 - 到目前为止,一切都很好。
猜你喜欢
  • 1970-01-01
  • 1970-01-01
  • 2018-10-14
  • 1970-01-01
  • 2021-09-29
  • 2020-05-19
  • 1970-01-01
  • 2011-08-21
  • 1970-01-01
相关资源
最近更新 更多