如何将具有不同分隔符的 CSV 加载到单个 Hadoop 表中答案

【问题标题】：How to load CSV with different delimiter to a single Hadoop table如何将具有不同分隔符的 CSV 加载到单个 Hadoop 表中
【发布时间】：2017-06-02 07:42:37
【问题描述】：

我想用多个 CSV 文件填充 Hive 表。问题是并非所有文件都具有相同的分隔符。在创建表时，我只能指定一个分隔符，例如~

create table status (type string, ...) 
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
with serdeproperties ("separatorChar" = "~")
STORED AS TEXTFILE

Hive 是否有允许多个 CSV 分隔符的内置功能？我知道这些文件可以在加载之前通过 Hadoop 作业进行标准化，或者基于 https://stackoverflow.com/a/26356592/2207078 我可以使用 pig 来执行此操作，但我正在寻找一些内置功能。理想情况下，我想创建没有指定分隔符的状态表，并指示 Hive 如何在 LOAD 上分隔列。

【问题讨论】：

我不确定您的回答应该如何帮助我或其他人？这里的附加值是什么？
不是答案，而是评论。我能想到 2 个“解决方案”，但您是否开始意识到它有多糟糕？
其实我没有。我可以想象我有两个供应商为我提供完全相同的信息，只是格式略有不同（更精确的分隔符）。如果 Hive 仍然在内部存储加载的数据，为什么这会如此糟糕？
(1) 在这个虚构的用例中，我们可以看到缺少数据管理的两个基本概念 - 集成和 ETL。数据不会奇迹般地出现在一个人的系统中。 (2) 您提到的猪“解决方案”注定只能在非常特定的用例中工作。 (3) 在 hive 中没有“stored...internally”这样的东西，无论如何这里的问题不是你如何存储数据，而是你如何阅读首先正确的数据。

标签： csv hadoop hive

【解决方案1】：

演示

数据文件

逗号.txt

|Now|,I've,heard,there,was
a,secret,chord;,That,David 
played,||and||,it,,pleased
the,,,Lord;,

分号.txt

But;;you;don't;really 
|care|;for;music;do;||||| you |||||?

管道.txt

,It,|,goes,|,like,|,this,|,the, 
fourth|the|fifth|The|;minor n
fall|the|;major|lift|The
baffled|king||composing|hallelujah

DDL

create external table mytable 
(c1 string,c2 string,c3 string,c4 string,c5 string)
partitioned by (delim string)
;

alter table mytable set serdeproperties ('field.delim'=',');
alter table mytable add partition (delim='comma');  

alter table mytable set serdeproperties ('field.delim'=';');
alter table mytable add partition (delim='semicolon');

alter table mytable set serdeproperties ('field.delim'='|');
alter table mytable add partition (delim='pipeline');

将文件放在匹配的目录中

mytable
├── delim=comma
│   └── comma.txt
├── delim=pipeline
│   └── pipeline.txt
└── delim=semicolon
    └── semicolon.txt

select * from mytable
;

+---------+---------+--------+-----------+------------------+-----------+
|   c1    |   c2    |   c3   |    c4     |        c5        |   delim   |
+---------+---------+--------+-----------+------------------+-----------+
| |Now|   | I've    | heard  | there     | was              | comma     |
| a       | secret  | chord; | That      | David            | comma     |
| played  | ||and|| | it     |           | pleased          | comma     |
| the     |         |        | Lord;     |                  | comma     |
| But     |         | you    | don't     | really           | semicolon |
| |care|  | for     | music  | do        | ||||| you |||||? | semicolon |
| ,It,    | ,goes,  | ,like, | ,this,    | ,the,            | pipeline  |
| fourth  | the     | fifth  | The       | ;minor           | pipeline  |
| fall    | the     | ;major | lift      | The              | pipeline  |
| baffled | king    |        | composing | hallelujah       | pipeline  |
+---------+---------+--------+-----------+------------------+-----------+

【讨论】：