【问题标题】:Hive query to map 3 array columns position wiseHive 查询以按位置映射 3 个数组列
【发布时间】:2020-04-16 23:39:05
【问题描述】:
i/p:

c1                        c2                        c3
[[1,2,3],[4],[5,6]]       ['v1','v2','v3']          [['sam'], ['tam'], ['bam']] 

o/p:

c1                        c2                        c3
[1,2,3]                   'v1'                      ['sam']
[4]                       'v2'                      ['tam']
[5,6]                     'v3'                      ['bam']

有人可以建议我如何为上述问题编写查询吗?

【问题讨论】:

    标签: arrays hadoop hive hiveql explode


    【解决方案1】:

    使用poseexplode():

    with your_data as (
    select array(array(1,2,3),array(4),array(5,6)) c1, array('v1','v2','v3') c2, array(array('sam'), array('tam'), array('bam')) c3
    --returns [[1,2,3],[4],[5,6]]  ["v1","v2","v3"]  [["sam"],["tam"],["bam"]]
    )
    
    select a1.c1, a2.c2, a3.c3
      from your_data d 
           lateral view posexplode(d.c1) a1 as p1, c1
           lateral view posexplode(d.c2) a2 as p2, c2
           lateral view posexplode(d.c3) a3 as p3, c3
     where a1.p1=a2.p2 and a1.p1=a3.p3 --match positions in exploded arrays
     --without this where condition
     --lateral views will produce cartesian product
     --alternatively you can explode arrays in subqueries and join them
     --using positions, in such way you can do left-join, not only inner
     ;
    

    结果:

    OK
    c1      c2      c3
    [1,2,3] v1      ["sam"]
    [4]     v2      ["tam"]
    [5,6]   v3      ["bam"]
    Time taken: 0.078 seconds, Fetched: 3 row(s)
    

    简化版,感谢@GrzegorzSkibinski 的建议:

    with your_data as (
        select array(array(1,2,3),array(4),array(5,6)) c1, array('v1','v2','v3') c2, array(array('sam'), array('tam'), array('bam')) c3
        --returns [[1,2,3],[4],[5,6]]  ["v1","v2","v3"]  [["sam"],["tam"],["bam"]]
        )
    
        select a1.c1, d.c2[a1.p1] as c2,  d.c3[a1.p1] as c3
          from your_data d 
               lateral view posexplode(d.c1) a1 as p1, c1
         ;
    

    【讨论】:

    • 大声思考——如果你只是 posexplode 第一列,然后用第一列的给定分解位置引用其余列,如 d.c2[p1] as c2,你将完全放弃整个 where 子句然后;)
    • @GrzegorzSkibinski 是有道理的。甚至更多:不需要爆炸 c2 和 c3
    【解决方案2】:

    使用explode:

    select explode(c1) as c1 from tab;
    

    如果您的用例更复杂,或者与 lateral view 一起使用:

    select
        c1_exploded,
        a,b,c
    from
        tab t
    lateral view explode(t.c1) tf as c1_exploded
    ;
    

    参考:https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

    【讨论】:

    • 直接使用 explode 将不起作用,因为 - 'UDTF 仅支持 SELECT 子句中的单个表达式' 使用稍后的视图将做交叉连接类型的事情,它不会是一对一的映射。 ......等等使用你的第二种方法,我会得到 3^3 行而不是 3 行
    • 噢,抱歉,我没有意识到c1c2c3 是同一个表的不同列。我以为他们是 3 张不同的桌子
    • 这样的话posexplode确实是最好的选择。
    猜你喜欢
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 1970-01-01
    • 2017-05-18
    • 1970-01-01
    • 2015-09-25
    • 1970-01-01
    • 1970-01-01
    相关资源
    最近更新 更多