HiveQL 连接查询 - NVL 在 where 子句中不起作用答案

【问题标题】：HiveQL join query - NVL not working in where clauseHiveQL 连接查询 - NVL 在 where 子句中不起作用
【发布时间】：2017-10-06 18:26:27
【问题描述】：

我有一个 HiveQL 查询，如下所示：

create table JOINED as select TABLEA.* from TABLEA join TABLEB on
TABLEA.key=TABLEB.key where nvl(TABLEA.attr, 0)=nvl(TABLEB.attr, 0);

但是这个查询不会选择TABLEA.key=TABLEB.key 和

TABLEA.attr=NULL 和 TABLEB.attr=NULL。（或）
TABLEA.attr=0 和 TABLEB.attr=NULL。（或）
TABLEA.attr=NULL 和 TABLEB.attr=0。

以上案例均未选中。为什么会发生这种情况？我是否误解了 NVL() 的使用？

如果 attr 属性为 NULL，我希望它默认为 0。什么是正确的查询？

【问题讨论】：

您是否尝试过使用COALESCE？
ATTR 列的数据类型是什么？
是的，我也尝试过 COALESCE。没有帮助。
数据类型为 BIGINT。
如果您在其中一张表上进行选择，nvl 和 coalesce 会在您认为 attr 列为空的地方返回什么？

标签： null hive hiveql nvl

【解决方案1】：

谢谢，我刚刚报告了一个错误 -
Incorrect results for INNER JOIN ON clause / WHERE involving NVL / COALESCE

如果您检查执行计划，您会发现对于两个表，我们得到了错误的谓词 attr is not null。
从两个表中选择列（例如select TABLEA.*,TABLEB.key）似乎可以避免这个问题。

explain
select TABLEA.* from TABLEA join TABLEB on
TABLEA.key=TABLEB.key where nvl(TABLEA.attr, 0)=nvl(TABLEB.attr, 0);

STAGE DEPENDENCIES:
  Stage-4 is a root stage
  Stage-3 depends on stages: Stage-4
  Stage-0 depends on stages: Stage-3

STAGE PLANS:
  Stage: Stage-4
    Map Reduce Local Work
      Alias -> Map Local Tables:
        $hdt$_0:tablea 
          Fetch Operator
            limit: -1
      Alias -> Map Local Operator Tree:
        $hdt$_0:tablea 
          TableScan
            alias: tablea
            Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (key is not null and attr is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: key (type: int), attr (type: int)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
                HashTable Sink Operator
                  keys:
                    0 _col0 (type: int), NVL(_col1,0) (type: int)
                    1 _col0 (type: int), NVL(_col1,0) (type: int)

  Stage: Stage-3
    Map Reduce
      Map Operator Tree:
          TableScan
            alias: tableb
            Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
            Filter Operator
              predicate: (key is not null and attr is not null) (type: boolean)
              Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
              Select Operator
                expressions: key (type: int), attr (type: int)
                outputColumnNames: _col0, _col1
                Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats: NONE
                Map Join Operator
                  condition map:
                       Inner Join 0 to 1
                  keys:
                    0 _col0 (type: int), NVL(_col1,0) (type: int)
                    1 _col0 (type: int), NVL(_col1,0) (type: int)
                  outputColumnNames: _col0, _col1
                  Statistics: Num rows: 1 Data size: 15 Basic stats: COMPLETE Column stats: NONE
                  File Output Operator
                    compressed: false
                    Statistics: Num rows: 1 Data size: 15 Basic stats: COMPLETE Column stats: NONE
                    table:
                        input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                        output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                        serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
      Local Work:
        Map Reduce Local Work

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

【讨论】：

感谢您的回复。但是，从两个表中选择列的临时解决方案似乎并不能阻止我的问题。无论如何，非常感谢！