【问题标题】:ORC Split Generation issue with Hive TableHive 表的 ORC 拆分生成问题
【发布时间】:2022-12-23 14:38:44
【问题描述】:

我在带有 Tez 0.9.2 的 Hadoop 3.3.4 上使用 Hive 版本 3.1.3。当我创建一个包含拆分的 ORC 表并尝试查询它时,我得到一个 ORC split generation failed 异常。 If I concatenate the table,这在某些情况下解决了这个问题。然而,在其他情况下,问题仍然存在。

首先我像这样创建表,然后尝试查询它:

CREATE TABLE ClaimsOrc STORED AS ORC
AS
SELECT *
FROM ClaimsImport;

SELECT COUNT(*) FROM ClaimsOrc WHERE ClaimID LIKE '%8%';

然后我得到以下异常:

Vertex failed, vertexName=Map 1, vertexId=vertex_1667735849290_0008_6_00, diagnostics=[Vertex vertex_1667735849290_0008_6_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: claimsorc initializer failed, vertex=vertex_1667735849290_0008_6_00 [Map 1], java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.addSplitsForGroup(HiveInputFormat.java:519)
        at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:765)
        at org.apache.hadoop.hive.ql.exec.tez.HiveSplitGenerator.initialize(HiveSplitGenerator.java:243)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:278)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable$1.run(RootInputInitializerManager.java:269)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:269)
        at org.apache.tez.dag.app.dag.RootInputInitializerManager$InputInitializerCallable.call(RootInputInitializerManager.java:253)
        at com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108)
        at com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41)
        at com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
        at java.lang.Thread.run(Thread.java:750)
Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
        at java.util.concurrent.FutureTask.report(FutureTask.java:122)
        at java.util.concurrent.FutureTask.get(FutureTask.java:192)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1790)

但是,如果我先 concatenate the table,将输出文件组合成更少的小文件,则该表工作正常:

ALTER TABLE ClaimsOrc CONCATENATE;
OK
Time taken: 11.673 seconds

SELECT COUNT(*) FROM ClaimsOrc WHERE ClaimID LIKE '%8%';
OK
1463419
Time taken: 7.446 seconds, Fetched: 1 row(s)

初始 CTAS 查询计算拆分的方式似乎出了点问题,并且 CONCATENATE 在某些情况下修复了它。但在某些情况下,它不会,也没有解决方法。我怎样才能解决这个问题?

其他一些值得注意的事情:

  • 使用DESCRIBE EXTENDED ClaimsOrc;表明ClaimsOrc是一个ORC表。
  • 源表 ClaimsImport 包含大约 24 个 gzip 管道分隔文件。
  • CONCATENATE之前,ClaimsOrc表包含大约24个文件
  • CONCATENATE之后,ClaimsOrc表只包含3个文件拆分
  • CONCATENATE 命令之前,ORC 文件似乎是有效的。使用orcfiledump command,我在我抽查的几个地方没有看到任何错误。

【问题讨论】:

    标签: hadoop hive orc apache-tez


    【解决方案1】:

    在 ORC 表上执行计数 (*) 时,我也面临同样的问题。请指教。

    Error: Error while processing statement: FAILED: Execution Error, return code 2 from org.apache.hadoop.hive.ql.exec.tez.TezTask. Vertex failed, vertexName=Map 1, vertexId=vertex_1670915386694_0168_1_00, diagnostics=[Vertex vertex_1670915386694_0168_1_00 [Map 1] killed/failed due to:ROOT_INPUT_INIT_FAILURE, Vertex Input: jio_ar_consumer_events initializer failed, vertex=vertex_1670915386694_0168_1_00 [Map 1], java.lang.RuntimeException: ORC split generation failed with exception: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.generateSplitsInfo(OrcInputFormat.java:1851)
        at org.apache.hadoop.hive.ql.io.orc.OrcInputFormat.getSplits(OrcInputFormat.java:1939)
    
    Caused by: java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError: org.apache.hadoop.fs.FileStatus.compareTo(Lorg/apache/hadoop/fs/FileStatus;)I
    

    【讨论】:

      猜你喜欢
      • 2018-02-09
      • 2018-12-11
      • 2017-11-02
      • 1970-01-01
      • 2015-09-13
      • 1970-01-01
      • 1970-01-01
      • 2018-06-10
      • 2020-07-09
      相关资源
      最近更新 更多