1. Optimize Joins
我们可以通过启用 Auto Convert Map Joins 和优化倾斜连接来提高连接的性能。
Auto Map Joins
Auto Map-Join 在连接大表和小表时是一项非常有用的功能。如果我们开启这个特性,小表会保存在每个节点的本地缓存中,然后在 Map 阶段与大表连接。启用自动地图连接有两个好处。首先,将小表加载到缓存中将节省每个数据节点的读取时间。其次,它避免了 Hive 查询中的倾斜连接,因为连接操作已经在 Map 阶段针对每个数据块完成。
Skew Joins
我们可以通过 hive shell 或 hive-site.xml 文件中的 SET 命令将 hive.optimize.skewjoin 属性设置为 true 来启用倾斜连接的优化,即不平衡连接。
<property>
<name>hive.optimize.skewjoin</name>
<value>true</value>
<description>
Whether to enable skew join optimization.
The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of
processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce
job, process those skewed keys. The same key need not be skewed for all the tables, and so,
the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a
map-join.
</description>
</property>
<property>
<name>hive.skewjoin.key</name>
<value>100000</value>
<description>
Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator,
we think the key as a skew join key.
</description>
</property>
<property>
<name>hive.skewjoin.mapjoin.map.tasks</name>
<value>10000</value>
<description>
Determine the number of map task used in the follow up map join job for a skew join.
It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.
</description>
</property>
<property>
<name>hive.skewjoin.mapjoin.min.split</name>
<value>33554432</value>
<description>
Determine the number of map task at most used in the follow up map join job for a skew join by specifying
the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.
</description>
</property>
2. Enable Bucketed Map Joins
如果表按特定列分桶,并且这些表用于连接,那么我们可以启用分桶映射连接以提高性能。
<property>
<name>hive.optimize.bucketmapjoin</name>
<value>true</value>
<description>Whether to try bucket mapjoin</description>
</property>
<property>
<name>hive.optimize.bucketmapjoin.sortedmerge</name>
<value>true</value>
<description>Whether to try sorted bucket merge map join</description>
</property>
.
3. Enable Tez Execution Engine
我们可以通过在 Tez 执行引擎上运行来将 Hive 查询的性能至少提高 100% 到 300%,而不是在古老的 Map-reduce 引擎上运行 Hive 查询。我们可以从 hive shell 中启用具有以下属性的 Tez 引擎。
hive> set hive.execution.engine=tez;
.
4. Enable Parallel Execution
Hive 将查询转换为一个或多个阶段。阶段可以是 MapReduce 阶段、采样阶段、合并阶段、限制阶段。默认情况下,Hive 一次执行一个阶段。一个特定的作业可能由一些相互不依赖的阶段组成,并且可以在
并行,可能使整个工作更快地完成。可以通过设置以下属性来启用并行执行。
<property>
<name>hive.exec.parallel</name>
<value>true</value>
<description>Whether to execute jobs in parallel</description>
</property>
<property>
<name>hive.exec.parallel.thread.number</name>
<value>8</value>
<description>How many jobs at most can be executed in parallel</description>
</property>
.
5. Enable Vectorization
矢量化功能仅在 hive-0.13.1 版本中首次引入 hive。通过向量化查询执行,我们可以提高扫描、聚合、过滤和连接等操作的性能,方法是一次执行 1024 行的批处理,而不是每次执行单行。
我们可以通过在 hive shell 或 hive-site.xml 文件中设置以下三个属性来启用矢量化查询执行。
hive> set hive.vectorized.execution.enabled = true;
hive> set hive.vectorized.execution.reduce.enabled = true;
hive> set hive.vectorized.execution.reduce.groupby.enabled = true;
.
6. Enable Cost Based Optimization
最近的 Hive 版本提供了基于成本的优化功能,可以根据查询成本实现进一步的优化,从而导致潜在的不同决策:如何对连接进行排序、执行哪种类型的连接、并行度等。
可以通过在 hive-site.xml 文件中设置以下属性来启用基于成本的优化。
<property>
<name>hive.cbo.enable</name>
<value>true</value>
<description>Flag to control enabling Cost Based Optimizations using Calcite framework.</description>
</property>
<property>
<name>hive.compute.query.using.stats</name>
<value>true</value>
<description>
When set to true Hive will answer a few queries like count(1) purely using stats
stored in metastore. For basic stats collection turn on the config hive.stats.autogather to true.
For more advanced stats collection need to run analyze table queries.
</description>
</property>
<property>
<name>hive.stats.fetch.partition.stats</name>
<value>true</value>
<description>
Annotation of operator tree with statistics information requires partition level basic
statistics like number of rows, data size and file size. Partition statistics are fetched from
metastore. Fetching partition statistics for each needed partition can be expensive when the
number of partitions is high. This flag can be used to disable fetching of partition statistics
from metastore. When this flag is disabled, Hive will make calls to filesystem to get file sizes
and will estimate the number of rows from row schema.
</description>
</property>
<property>
<name>hive.stats.fetch.column.stats</name>
<value>true</value>
<description>
Annotation of operator tree with statistics information requires column statistics.
Column statistics are fetched from metastore. Fetching column statistics for each needed column
can be expensive when the number of columns is high. This flag can be used to disable fetching
of column statistics from metastore.
</description>
</property>
<property>
<name>hive.stats.autogather</name>
<value>true</value>
<description>A flag to gather statistics automatically during the INSERT OVERWRITE command.</description>
</property>
<property>
<name>hive.stats.dbclass</name>
<value>fs</value>
<description>
Expects one of the pattern in [jdbc(:.*), hbase, counter, custom, fs].
The storage that stores temporary Hive statistics. In filesystem based statistics collection ('fs'),
each task writes statistics it has collected in a file on the filesystem, which will be aggregated
after the job has finished. Supported values are fs (filesystem), jdbc:database (where database
can be derby, mysql, etc.), hbase, counter, and custom as defined in StatsSetupConst.java.
</description>
</property>