Hive：创建表和分区方式答案

【问题标题】：Hive: Create Table and Partition ByHive：创建表和分区方式
【发布时间】：2012-12-11 06:51:51
【问题描述】：

我有一个加载数据的表格如下：

create table xyzlogTable (dateC string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) row format serde 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' with serdeproperties( "input.regex" = "(\\S+)\\t(\\d+):(\\d+):(\\d+)\\t(\\S+)\\t(\\S+)\\t(\\S+)\\t(\\S+)", "output.format.string" = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s") stored as textfile;

load data local inpath '/home/hadoop/hive/xyxlogData/' into table xyxlogTable;

发现总行数超过 300 万。有些查询工作正常，有些则进入无限循环。

在看到 select, group by 查询需要很长时间，有时甚至没有返回结果后，决定进行分区。

但是以下两个语句都失败了：

create table xyzlogTable (datenonQuery string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) partitioned by (dateC string);

FAILED：元数据错误：AlreadyExistsException（消息：表 xyzlogTable 已存在） FAILED：执行错误，从 org.apache.hadoop.hive.ql.exec.DDLTask 返回代码 1

Alter table xyzlogTable (datenonQuery string , hours string, minutes string, seconds string, TimeTaken string, Method string, UriQuery string, ProtocolStatus string) partitioned by (dateC string);

FAILED: Parse Error: line 1:12 cannot identify input 'xyzlogTable' in alter table statement

知道问题出在哪里！

【问题讨论】：

标签： hadoop hive

【解决方案1】：

这正是我更喜欢在 Hive 中使用外部表的原因。您创建的表不是外部的（您使用create table 而不是create external table）。对于非外部表，删除表、删除元数据（名称、列名、类型等）和 HDFS 中表的数据。相反，当删除外部表时，仅删除元数据，HDFS 中的数据会保留。

你有几个选择：

如果导入成本高且数据尚未分区。保留此表，但创建一个新表，例如 xyzlogTable_partitioned，它将是此表的分区版本。您可以在 Hive 中使用 Dynamic Partitioning 来填充这个新表。
如果导入成本高但数据已经分区；例如，假设您已经在 HDFS 中的每个分区的单独文件中拥有数据。创建一个新的分区表并有一个bash脚本（或等效的），从未分区表对应的HDFS目录移动（或复制然后删除，如果你保守的话）到新的适当分区对应的目录表。
如果导入便宜：删除整个表。重新创建一个新的分区表并重新导入。很多时候，如果导入过程不知道分区模式（换句话说，如果导入不能直接将数据推送到适当的分区中），那么拥有一个未分区表（就像您已经拥有的表一样）是一个常见的用例作为临时表，然后使用 Hive 查询或动态分区来填充新的分区表，该表将在工作流的后续查询中使用。

【讨论】：

【解决方案2】：

您应该首先删除已创建的表，然后创建分区表。或者更改您的表名。

【讨论】：