【发布时间】:2019-12-27 23:26:23
【问题描述】:
我在 Cloudera 5.16 和 Hadoop 2.6 上。
我使用 ImportTsv 将大的 csv 文件加载到 HBase。
hbase org.apache.hadoop.hbase.mapreduce.ImportTsv -Dimporttsv.separator=';' -Dimporttsv.columns=HBASE_ROW_KEY,data:name,data:age mynamespace:mytable /path/to/csv/dir/*.csv
我的问题是,无论文件大小如何(我的文件有 300k 行,而其他文件有 1k 行),操作需要 20 到 30 秒。
19/08/22 15:11:56 INFO mapreduce.Job: Job job_1566288518023_0335 running in uber mode : false
19/08/22 15:11:56 INFO mapreduce.Job: map 0% reduce 0%
19/08/22 15:12:06 INFO mapreduce.Job: map 67% reduce 0%
19/08/22 15:12:08 INFO mapreduce.Job: map 100% reduce 0%
19/08/22 15:12:08 INFO mapreduce.Job: Job job_1566288518023_0335 completed successfully
19/08/22 15:12:08 INFO mapreduce.Job: Counters: 34
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=801303
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=2709617
HDFS: Number of bytes written=0
HDFS: Number of read operations=6
HDFS: Number of large read operations=0
HDFS: Number of write operations=0
HDFS: Number of bytes read erasure-coded=0
Job Counters
Launched map tasks=3
Data-local map tasks=3
Total time spent by all maps in occupied slots (ms)=25662
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=25662
Total vcore-milliseconds taken by all map tasks=25662
Total megabyte-milliseconds taken by all map tasks=26277888
Map-Reduce Framework
Map input records=37635
Map output records=37635
Input split bytes=531
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=454
CPU time spent (ms)=14840
Physical memory (bytes) snapshot=1287696384
Virtual memory (bytes) snapshot=8280121344
Total committed heap usage (bytes)=2418540544
Peak Map Physical memory (bytes)=439844864
Peak Map Virtual memory (bytes)=2776657920
ImportTsv
Bad Lines=0
File Input Format Counters
Bytes Read=2709086
File Output Format Counters
Bytes Written=0
我根据密钥创建了多个区域来分发 put,但它没有改变任何东西。
create 'mynamespace:mytable', {NAME => 'data', COMPRESSION => 'SNAPPY'}, {SPLITS => ['0','1','2','3','4','5']}
有人知道如何优化这个操作吗?
谢谢。
【问题讨论】:
标签: import hbase cloudera bulkinsert