【发布时间】:2014-09-07 08:21:09
【问题描述】:
我有一张如下表:
COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2
e-12 1101 201408110525 Arts and Entertainment Television
e-12 1101 201408110525 Arts and Entertainment Television
e-12 1101 201408110525 Arts and Entertainment Television
e-12 1101 201408110620 Technology and Computing Internet Technology
e-12 1101 201408110705 Technology and Computing Antivirus Software
e-12 1107 201408110510 Business Advertising
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1107 201408110520 Business Marketing
e-12 1109 201408110505 Technology and Computing Web Search
忽略 COL1(因为它们都是相同的),对于每个 COL2,都有其余字段的组合。我设法得到了重复组合的计数,结果如下:
COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2 COUNT
e-12 1101 201408110525 Arts and Entertainment Television 3
e-12 1101 201408110620 Technology and Computing Internet Technology 1
e-12 1101 201408110705 Technology and Computing Antivirus Software 1
e-12 1107 201408110510 Business Advertising 1
e-12 1107 201408110520 Business Marketing 7
e-12 1109 201408110505 Technology and Computing Web Search 1
如何将计数转换为每个 COL2 的所有组合的百分比?
很抱歉,我无法更好地用文字表达,但输出应该是这样的:
COL1 COL2 DATETIMESTAMP CATEGORY1 CATEGORY2 COUNT PERCENTAGE
e-12 1101 201408110525 Arts and Entertainment Television 3 60%
e-12 1101 201408110620 Technology and Computing Internet Technology 1 20%
e-12 1101 201408110705 Technology and Computing Antivirus Software 1 20%
e-12 1107 201408110510 Business Advertising 1 12.5%
e-12 1107 201408110520 Business Marketing 7 87.5%
e-12 1109 201408110505 Technology and Computing Web Search 1 100%
注意:此时,不需要计数。
这在 Hive 中是否可行?如何修改我的计数查询(如下)以输出最后一个表?
SELECT COL1, COL2, DATETIMESTAMP, CATEGORY1, CATEGORY2, count(*) FROM temp_table GROUP BY CATEGORY1, CATEGORY2, DATETIMESTAMP, COL2, COL1 SORT BY COL2;
谢谢。
【问题讨论】:
-
您可以使用两个 select 语句分别计算 col2 和 category2,然后在主 select 语句中使用它们