【问题标题】:Word count using Hive使用 Hive 进行字数统计
【发布时间】:2016-12-08 20:41:21
【问题描述】:

假设我有一个包含列 id 和内容的表格:

id | content
________________________
1  | abc abr abc as abs
2  | abc arc cre arc
3  | agr ann agd agd agd 

我想要的是这样的输出:

{"abc":2,"abr":1,"as":1, "abs":1}  # for id 1
{"abc":1,"arc":2,"cre":1}          # for id 2
{"agr":1,"agd":3,"ann":1}          # for id 3

如何使用 Hive 完成任务?

【问题讨论】:

    标签: count hive hdfs hql


    【解决方案1】:

    您需要this 库。构建起来非常简单。

    查询

    ADD JAR /path/to/jar/brickhouse-0.7.1.jar;
    CREATE TEMPORARY FUNCTION COLLECT AS 'brickhouse.udf.collect.CollectUDAF';
    
    SELECT id
      , COLLECT(words, c) AS count_map
    FROM (
      SELECT id
        , words
        , COUNT(*) AS c
      FROM (
        SELECT id, words
        FROM db.tbl
        LATERAL VIEW EXPLODE(SPLIT(content, ' ')) exptbl AS words ) x
      GROUP BY id, words ) y
    GROUP BY id
    

    输出

    +----+---------------------------------+
    |id  |count_map                        |
    +----+---------------------------------+
    |1   |{"as":1,"abs":1,"abc":2,"abr":1} |
    +----+---------------------------------+
    |2   |{"cre":1,"arc":2,"abc":1}        |
    +----+---------------------------------+
    |3   |{"ann":1,"agr":1,"agd":3}        |
    +----+---------------------------------+
    

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2012-04-19
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多