如何限制 Hive 作业中映射器的数量？答案

【问题标题】：How to limit the number of mappers in Hive job?如何限制 Hive 作业中映射器的数量？
【发布时间】：2017-10-23 13:05:34
【问题描述】：

在我的三节点集群中，我已经优化了性能所需的所有参数。但这对我来说没有多大帮助，我们所有的 hive 表都是用 parquet 格式创建的，当我的团队尝试从外部表加载到内部表时，请在下面找到脚本，

ksh -c 'hadoop fs -rm -R 
 hdfs:///user/hive/warehouse/bistore_sit_cycle2.db/wt_consumer/d_partition_number=0;
        hive -e  "set hive.exec.dynamic.partition.mode=nonstrict;
        insert into bistore_sit_cycle2.wt_consumer
        partition(d_partition_number)
        select * from bistore_sit_cycle2.ext_wt_consumer;
        set hive.exec.dynamic.partition.mode=strict;"'

加载需要 2 个多小时，hive 作业使用 718 个映射器创建并在每个节点上运行 2 个容器，同时 5 个映射器只为该作业运行。负载是 85M 记录和大约 35GB。

如何用更少的映射器运行这样的作业以及如何增加运行映射器的并发度!!!???

And this is my complete Cluster and YARN configuration details,

CPU: Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (16 physical cores)(32 hyper threaded) 
RAM:256GB 
DISK:1.2TB x 16
MapR 5.0.0 - Community Edition
        mapreduce.map.memory.mb=10g
        mapreduce.reduce.memory.mb=16g
        yarn.app.mapreduce.am.resource.mb=16g
        yarn.app.mapreduce.am.command-opts=15g
        yarn.app.mapreduce.am.resource.cpu-vcores=16
        mapreduce.map.cpu.vcores=12
        mapreduce.reduce.cpu.vcores=16
        mapreduce.map.disk=1.5
        mapreduce.reduce.disk=3.99
        mapreduce.map.java.opts=9g
        mapreduce.reduce.java.opts=15g
        mapreduce.task.io.sort.mb=1024
        mapreduce.task.io.sort.factor=1024
        mapreduce.reduce.shuffle.parallelcopies=48
        yarn.nodemanager.resource.memory-mb=180g
        yarn.scheduler.maximum-allocation-mb=180g
        yarn.scheduler.minimum-allocation-mb=4g
        yarn.nodemanager.resource.cpu-vcores=32
        yarn.nodemanager.vmem-pmem-ratio=3.2
        HADOOP_CLIENT_OPTS=32g

【问题讨论】：

“我们所有的 hive 表都是用 parquet 格式创建的” >> 甚至是 EXTERNAL 文件？？？
如果您的输入是 CSV 文件，您可以告诉 Hive 为每个 Mapper 处理多个小文件——请参阅下面的评论
您也可以尝试减小容器大小——对于特定情况，即使使用CombineInputFormat，默认值也可能过高。
检查默认容器大小，set mapreduce.map.memory.mb ; set mapreduce.reduce.memory.mb ; set yarn.app.mapreduce.am.resource.mb ; cf。 hadoop.apache.org/docs/r2.7.2/hadoop-mapreduce-client/… >> 还有set hive.tez.container.size ; 如果您使用的是 TEZ，请参阅。 cwiki.apache.org/confluence/display/Hive/…
Duh... yarn.app.mapreduce.am.command-opts 看起来不像您在 Java 命令行上传递的任何东西；我想知道为什么 YARN AppMaster 需要 16 GB 和 16 个（虚拟）内核来做，好吧，除了启动和监控 Map & Reduce 容器之外什么都没有。这就是你所说的“优化”吗？！？

标签： hadoop hive mapr

【解决方案1】：

为 hive 查询生成的映射器数量取决于输入拆分。您有 35 GB 的数据，您将获得 718 个映射器。这意味着您的拆分大小约为 49 MB (35*1024/718)。您的集群只有三个节点，并且根据您的 YARN 容器大小设置，它可能最多只能生成 5 个容器。为了增加并行度，您需要添加更多容器，即垂直或水平扩展集群节点。您需要更多数量或映射器来提高性能，更少的映射器意味着更少的并行度。

【讨论】：

反对意见：Hadoop CombineFileInputFormat 专门用于缓冲每个 Mapper 的多个小拆分；通过 hive.hadoop.supports.splittable.combineinputformat 属性在 Hive 中使用：“是否合并小型输入文件以便生成更少的映射器” -- 另请参阅 hive.input.format in cwiki.apache.org/confluence/display/Hive/… >>警告：这仅适用于 TextFile 或 SequenceFile。
我同意。但我不认为 karthee 使用 CombineInputFormat。假设节点是低端商品硬件，他从三个节点中获得 5 个容器是最佳的。如果是服务器级硬件，他可以调整纱线容器设置以产生最大数量的容器。
嗨，卑鄙的我和 Samson Scharfrichter...请找到我新编辑的问题，我已经添加了完整的配置详细信息。谢谢
我也尝试过使用这些属性...mapreduce.job.maps6mapreduce.job.maps3mapreduce.tasktracker.map.tasks.maximum10mapreduce.tasktracker.reduce.tasks.maximum6
我的建议是你试着弄清楚为什么你最多只能得到 5 个容器，比如你可以检查队列分配和纱线容器最小大小设置。我认为你不应该减少映射器的数量，因为你有 35 GB 的镶木地板数据。