【问题标题】:Hive - Create map columns type by aggregating values across groupsHive - 通过跨组聚合值来创建地图列类型
【发布时间】:2018-01-07 14:09:27
【问题描述】:

我有一张如下所示的表格:

|customer|category|room|date|
-----------------------------
|1       |   A    | aa | d1 |
|1       |   A    | bb | d2 |
|1       |   B    | cc | d3 |
|1       |   C    | aa | d1 |
|1       |   C    | bb | d2 |
|2       |   A    | aa | d3 |
|2       |   A    | bb | d4 |
|2       |   C    | bb | d4 |
|2       |   C    | ee | d5 |
|3       |   D    | ee | d6 |

我想从表格中创建两个地图:

第一map_customer_room_date:将按客户分组收集所有不同的房间(key)和日期(value )。

我正在使用collect() UDF Brickhouse 函数。

这可以用类似的东西存档:

select customer, collect(room,date) as map_customer_room_date
from table
group by customer

2nd. map_category_room_date 有点复杂,也包含相同的地图类型collect(room, date),它将包含所有类别的所有房间作为键客户 X 是类别。 这意味着对于 customer1,它会占用空间 ee,即使它属于 customer2。这是因为客户 1 的类别为 C,并且此类别也存在于客户 2 中。

决赛桌按客户分组,如下所示:

|customer| map_customer_room_date  |     map_category_room_date    |
-------------------------------------------------------------------|
|   1    |{aa: d1, bb: d2, cc: d3} |{aa: d1, bb: d2, cc: d3,ee: d6}|
|   2    |{aa: d3, bb: d4, ee: d6} |{aa: d3, bb: d4, ee: d6}       |
|   3    |{ee: d6}                 |{ee: d6}                       |  

我在构建第二张地图和展示决赛桌时遇到了问题。 知道如何实现吗?

【问题讨论】:

    标签: sql hadoop types hive collect


    【解决方案1】:

    这可以使用一系列自连接来完成,以在将结果组合到 2 张地图之前找到同一类别中的其他房间。

    代码

    CREATE TABLE `table` AS
    SELECT 1 AS customer, 'A' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
    SELECT 1 AS customer, 'A' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
    SELECT 1 AS customer, 'B' AS category, 'cc' AS room, 'd3' AS `date` UNION ALL
    SELECT 1 AS customer, 'C' AS category, 'aa' AS room, 'd1' AS `date` UNION ALL
    SELECT 1 AS customer, 'C' AS category, 'bb' AS room, 'd2' AS `date` UNION ALL
    SELECT 2 AS customer, 'A' AS category, 'aa' AS room, 'd3' AS `date` UNION ALL
    SELECT 2 AS customer, 'A' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
    SELECT 2 AS customer, 'C' AS category, 'bb' AS room, 'd4' AS `date` UNION ALL
    SELECT 2 AS customer, 'C' AS category, 'ee' AS room, 'd5' AS `date` UNION ALL
    SELECT 3 AS customer, 'D' AS category, 'ee' AS room, 'd6' AS `date`
    ;
    
    
    SELECT
        customer_rooms.customer,
        collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
        collect(
            COALESCE(customer_category_rooms.room, category_rooms.room),
            COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
    FROM `table` AS customer_rooms
    JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
    LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
    AND category_rooms.category = customer_category_rooms.category
    AND category_rooms.room = customer_category_rooms.room
    WHERE (
        customer_rooms.customer = customer_category_rooms.customer AND
        customer_rooms.category = customer_category_rooms.category AND
        customer_rooms.room = customer_category_rooms.room AND
        customer_rooms.date = customer_category_rooms.date
    )
    OR (
        customer_category_rooms.customer IS NULL AND
        customer_category_rooms.category IS NULL AND
        customer_category_rooms.room IS NULL AND
        customer_category_rooms.date IS NULL
    )
    GROUP BY
        customer_rooms.customer
    ;
    

    结果集

    1   {"aa":"d1","bb":"d2","cc":"d3"} {"aa":"d1","bb":"d2","cc":"d3","ee":"d5"}
    2   {"aa":"d3","bb":"d4","ee":"d5"} {"aa":"d3","bb":"d4","ee":"d5"}
    3   {"ee":"d6"} {"ee":"d6"}
    

    说明

    FROM `table` AS customer_rooms
    

    首先,从最初的table 中提取结果。我们将此关系命名为customer_rooms。正如您在问题中已经指出的那样,这足以构建 map_customer_room_date

    JOIN `table` AS category_rooms ON customer_rooms.category = category_rooms.category
    

    第一个自联接标识与customer_rooms 行中明确提到的房间具有相同类别的所有房间。我们将此关系命名为category_rooms

    LEFT OUTER JOIN `table` AS customer_category_rooms ON customer_rooms.customer = customer_category_rooms.customer
    AND category_rooms.category = customer_category_rooms.category
    AND category_rooms.room = customer_category_rooms.room
    

    第二个自加入获取我们在category_rooms 中标识的房间,并尝试查找该房间是否已由customer_rooms 中标识的客户持有。我们将此关系命名为customer_category_rooms。这是LEFT OUTER JOIN,因为我们想要保留之前连接的所有行。结果将是 1) customer_roomscustomer_category_rooms 的值相同,因为客户已经拥有这个房间,或者 2) customer_category_rooms 的值将全部是 NULL,因为客户没有持有这个房间,但它是同一类别中的一个房间。这种区别将变得很重要,以便我们可以保留客户的date(如果他们已经拥有房间)。

    接下来,我们需要过滤。

    WHERE (
        customer_rooms.customer = customer_category_rooms.customer AND
        customer_rooms.category = customer_category_rooms.category AND
        customer_rooms.room = customer_category_rooms.room AND
        customer_rooms.date = customer_category_rooms.date
    )
    

    这包括客户在原始table 中明确持有的房间。

    OR (
        customer_category_rooms.customer IS NULL AND
        customer_category_rooms.category IS NULL AND
        customer_category_rooms.room IS NULL AND
        customer_category_rooms.date IS NULL
    )
    

    这包括不是客户持有但与客户持有的房间属于同一类别的房间。

        collect(customer_rooms.room, customer_rooms.date) AS map_customer_room_date,
    

    map_customer_room_date 可以通过从表中收集原始数据来构建,我们将其别名为customer_rooms

        collect(
            COALESCE(customer_category_rooms.room, category_rooms.room),
            COALESCE(customer_category_rooms.date, category_rooms.date)) AS map_category_room_date
    

    map_category_room_date 大楼更复杂。如果客户明确持有房间,那么我们希望保留该date。但是,如果客户没有明确地持有房间,那么我们希望能够使用具有重叠类别的另一行中的 roomdate。为此,我们使用 Hive COALESCE 函数选择第一个不是 NULL 的值。如果客户已经拥有房间(如 customer_category_rooms 中的非 NULL 值所示),那么我们将使用它。如果不是,那么我们将使用来自 category_rooms 的值。

    请注意,如果同一类别/房间组合可以映射到多个 date 值,则仍可能存在一些歧义。如果这很重要,那么您可能需要投入更多的工作来根据某些业务规则(例如使用最快的date)选择正确的date,或者映射到多个date 值而不是单个值。如果有类似的额外要求,这应该会给你一个很好的起点。

    【讨论】:

      猜你喜欢
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 1970-01-01
      • 2018-04-05
      • 2016-10-16
      • 2017-01-17
      • 1970-01-01
      • 1970-01-01
      相关资源
      最近更新 更多