蜂巢 |在日期创建分区答案

【问题标题】：Hive | Create partition on a date蜂巢 |在日期创建分区
【发布时间】：2020-08-26 17:15:42
【问题描述】：

我需要在 csv 文件的顶部创建一个外部配置单元表。 CSV 有 col1、col2、col3 和 col4。

但是我的外部配置单元表应该在 month 进行分区，但我的 csv 文件没有任何月份字段。 col1 是日期字段。我该怎么做？

【问题讨论】：

标签： hive hiveql hive-partitions

【解决方案1】：

您需要将数据重新加载到分区表中。

使用 CSV 在文件夹顶部创建非分区表 (mytable)。
创建分区表（mytable_part）

create table mytable_part( --columns specification here for col1, col2, col3, col4 ) partitioned by (part_month string) ... stored as textfile --you can chose any format you need
使用动态分区将数据加载到分区表中，在查询中计算分区列：

设置 hive.exec.dynamic.partition=true; 设置 hive.exec.dynamic.partition.mode=nonstrict;

insert overwrite table mytable_part partition (part_month) select col1, col2, col3, col4, substr(col1, 1, 7) as part_month --partition column in yyyy-MM format from mytable distribute by substr(col1, 1, 7) --to reduce the number of files ;

【讨论】：

谢谢，但是没有任何中间表有什么办法。
@ArpitaMishra 无论如何，您需要重新分区数据。您可以使用 Spark 读取文件并写入分区表。但是您不能在未分区的数据之上创建分区表，并期望 hive 会为您重新分区。不，您需要重新分区数据。 Hive 中的分区有它自己的位置，里面有文件管理器，这些文件应该只包含属于该分区的数据。中间表虽然不是大问题

【解决方案2】：

试试这个方法

将 csv 数据复制到 HDFS 位置 hdfs://somepath/5 的文件夹中，并将该路径作为分区添加到外部表中。

create external table ext1(
    col1   string
    ,col2  string
    ,col3  string
    ,col4  string
)
partition by (mm int)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
STORED AS ORC;

alter table ext1 add partition(mm = 5) location 'hdfs://yourpath/5';

【讨论】：