【问题标题】:Hive : Cannot INSERT OVERWRITE TABLE from unpartitioned External Table into a new partitioned tableHive:无法将未分区外部表中的覆盖表插入到新的分区表中
【发布时间】:2016-08-19 07:49:05
【问题描述】:

总而言之,这就是我所做的

原始数据 -> 在 HDFS 中选择并保存过滤后的数据 -> 使用保存在 HDFS 中的文件创建外部表 -> 使用外部表填充一个空表。

查看异常,似乎这与两个表之间的 OUTPUT 类型有关

详细说明

1) 我有一个包含大量数据的“table_log”表(在数据库 A 中),其结构如下(有 3 个分区):

CREATE TABLE `table_log`(
  `e_id` string, 
  `member_id` string, 
  .
  .
PARTITIONED BY ( 
  `dt` string, 
  `service_type` string, 
  `event_type` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
  COLLECTION ITEMS TERMINATED BY '\u0002' 
  MAP KEYS TERMINATED BY '\u0003' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

2) 我通过 (td,service_type,event_type) 过滤数据并将结果保存在 HDFS 中,如下所示:

INSERT OVERWRITE DIRECTORY  '/user/atscale/filterd-ratlog' SELECT * FROM rat_log WHERE dt >= '2016-05-01' AND dt <='2016-05-31' AND service_type='xxxx_jp' AND event_type='vv';

3) 然后我创建了一个具有上述结果的 外部表 (table_log_filtered_ext)(在数据库 B 中)。 请注意,此表没有分区。

DROP TABLE IF EXISTS table_log_filtered_ext;
CREATE EXTERNAL TABLE `table_log_filtered_ext`(
  `e_id` string, 
  `member_id` string, 
  .
  .
  dt string,
  service_type string,
  event_type string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
  COLLECTION ITEMS TERMINATED BY '\u0002' 
  MAP KEYS TERMINATED BY '\u0003'
LOCATION '/user/atscale/filterd-ratlog'

4)我创建了另一个类似于“table_log”结构(有3个分区)的新表(table_log_filtered):

CREATE TABLE `table_log_filtered` (
  `e_id` string, 
  `member_id` string, 
  .
  .
PARTITIONED BY ( 
  `dt` string, 
  `service_type` string, 
  `event_type` string)
ROW FORMAT DELIMITED 
  FIELDS TERMINATED BY '\u0001' 
  COLLECTION ITEMS TERMINATED BY '\u0002' 
  MAP KEYS TERMINATED BY '\u0003' 
STORED AS INPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcInputFormat' 
OUTPUTFORMAT 
  'org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat'

5)现在我想从外部表“table_log_filtered_ext”中的数据填充“table_log_filtered”表(在“table_log”中有3个分区)

SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.execution.engine=tez; 

INSERT OVERWRITE TABLE rat_log_filtered PARTITION(dt, service_type, event_type) 
SELECT * FROM table_log_filtered_ext;

但我得到了这个“java.lang.ClassCastException。 查看异常,这与两个表之间的 OUTPUT 类型有关。 任何提示?:

java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0) {"key":{},"value":
.
.
.
      at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:173)
      at org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:139)
      at org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:344)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:181)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable$1.run(TezTaskRunner.java:172)
      at java.security.AccessController.doPrivileged(Native Method)
      at javax.security.auth.Subject.doAs(Subject.java:422)
      at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:172)
      at org.apache.tez.runtime.task.TezTaskRunner$TaskRunnerCallable.callInternal(TezTaskRunner.java:168)
      at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)
      at java.util.concurrent.FutureTask.run(FutureTask.java:266)
      at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      at java.lang.Thread.run(Thread.java:745)
    Caused by: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing row (tag=0
      at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:370)
      at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource.pushRecord(ReduceRecordSource.java:292)
      ... 16 more
    Caused by: java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.hive.ql.io.orc.OrcSerde$OrcSerdeRow
      at org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:81)
      at org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
      at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
      at org.apache.hadoop.hive.ql.exec.LimitOperator.process(LimitOperator.java:54)
      at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838)
      at org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88)
      at org.apache.hadoop.hive.ql.exec.tez.ReduceRecordSource$GroupIterator.next(ReduceRecordSource.java:361)
      ... 17 more

【问题讨论】:

  • 请您尝试使用SET hive.execution.engine=mr;检查错误是否仍然存在
  • yes..less 与 MR 的堆栈跟踪,但似乎是同样的错误。我认为这与两个表之间的 OUTPUT 类型有关?
  • ORC 是一种复杂的columnar 格式,CREATE 脚本应该只指定STORED AS ORC。因为ROW FORMAT DELIMITED 根本没有意义(仅适用于 格式,例如TextFile 和SequenceFile),并且只有受虐狂在SerDe 时使用INPUTFORMATOUTPUTFORMAT 子句完全由它的别名定义。

标签: hadoop hive hdfs hiveql


【解决方案1】:

以防万一其他人遇到此问题,修复方法正如@Samson Scharfrichter 提到的那样,我为 table_log_filtered 指定了 STORED AS ORC

CREATE TABLE `table_log_filtered` (
  `e_id` string, 
  `member_id` string, 
  .
  .
PARTITIONED BY ( 
  `dt` string, 
  `service_type` string, 
  `event_type` string)
STORED AS ORC

【讨论】:

  • Pleeeease,同时删除对 ORC 没有意义的 ROW FORMAT DELIMITED 部分。充其量,它被完全忽略;在最坏的情况下,它可能会产生令人讨厌的副作用 - 正如您在初次尝试 OUTPUTFORMAT 时所经历的那样。
猜你喜欢
  • 2017-03-01
  • 1970-01-01
  • 2018-07-11
  • 2015-01-10
  • 1970-01-01
  • 2015-09-13
  • 2023-03-19
  • 1970-01-01
  • 1970-01-01
相关资源
最近更新 更多