Hive 连接优化答案

【问题标题】：Hive join optimizationHive 连接优化
【发布时间】：2015-09-03 07:56:57
【问题描述】：

我有两组数据都存储在一个 S3 存储桶中，我需要在 Hive 中处理它们并将输出存储回 S3。每个数据集的样本行如下：

DataSet 1: {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126"}

DataSet2: {"requestId":"TADS6152JHGJH5435","userAgent":"Mozilla"}

我需要根据requestId加入这两个数据集，并输出一个组合行为：

Output:  {"requestId":"TADS6152JHGJH5435", "customerId":"ASJHAGSJH","sessionId":"172356126","userAgent":"Mozilla"}

数据集 1 中的 requestId 是数据集 2 中请求 ID 的正确子集。我使用 LEFT OUTER JOIN 来获取输出。这是我的 Hive 脚本的简化版本：

CREATE EXTERNAL TABLE dataset1 (
     requestId string,
     customerId string,
     sessionId string
 )
LOCATION 's3://path_to_dataset1/';

CREATE EXTERNAL TABLE dataset2 (
     requestId string,
     userAgent string
 )
LOCATION 's3://path_to_dataset2/';

CREATE EXTERNAL TABLE output (
     requestId string,
     customerId string,
     sessionId string,
     userAgent string
 )
LOCATION 's3://path_to_output/';

INSERT OVERWRITE TABLE output
  SELECT d1.requestId, d1.customerId, d1.sessionId, d2.userAgent
  FROM dataset1 d1 LEFT OUTER JOIN dataset2 d2
  ON (d1.requestId=d2.requestId);

我的问题是：

是否有机会优化此连接？我可以使用表的分区/分桶来更快地运行连接吗？我在脚本中将hive.auto.convert.join 设置为true。我还应该设置哪些其他配置单元属性才能为上述查询获得更好的性能？

【问题讨论】：

标签： hadoop join hive hiveql

【解决方案1】：

1. Optimize Joins

我们可以通过启用 Auto Convert Map Joins 和优化倾斜连接来提高连接的性能。

Auto Map Joins

Auto Map-Join 在连接大表和小表时是一项非常有用的功能。如果我们开启这个特性，小表会保存在每个节点的本地缓存中，然后在 Map 阶段与大表连接。启用自动地图连接有两个好处。首先，将小表加载到缓存中将节省每个数据节点的读取时间。其次，它避免了 Hive 查询中的倾斜连接，因为连接操作已经在 Map 阶段针对每个数据块完成。

Skew Joins

我们可以通过 hive shell 或 hive-site.xml 文件中的 SET 命令将 hive.optimize.skewjoin 属性设置为 true 来启用倾斜连接的优化，即不平衡连接。

  <property>
    <name>hive.optimize.skewjoin</name>
    <value>true</value>
    <description>
      Whether to enable skew join optimization. 
      The algorithm is as follows: At runtime, detect the keys with a large skew. Instead of
      processing those keys, store them temporarily in an HDFS directory. In a follow-up map-reduce
      job, process those skewed keys. The same key need not be skewed for all the tables, and so,
      the follow-up map-reduce job (for the skewed keys) would be much faster, since it would be a
      map-join.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.key</name>
    <value>100000</value>
    <description>
      Determine if we get a skew key in join. If we see more than the specified number of rows with the same key in join operator,
      we think the key as a skew join key. 
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.map.tasks</name>
    <value>10000</value>
    <description>
      Determine the number of map task used in the follow up map join job for a skew join.
      It should be used together with hive.skewjoin.mapjoin.min.split to perform a fine grained control.
    </description>
  </property>
  <property>
    <name>hive.skewjoin.mapjoin.min.split</name>
    <value>33554432</value>
    <description>
      Determine the number of map task at most used in the follow up map join job for a skew join by specifying 
      the minimum split size. It should be used together with hive.skewjoin.mapjoin.map.tasks to perform a fine grained control.
    </description>
  </property>

2. Enable Bucketed Map Joins

如果表按特定列分桶，并且这些表用于连接，那么我们可以启用分桶映射连接以提高性能。

  <property>
    <name>hive.optimize.bucketmapjoin</name>
    <value>true</value>
    <description>Whether to try bucket mapjoin</description>
  </property>
  <property>
    <name>hive.optimize.bucketmapjoin.sortedmerge</name>
    <value>true</value>
    <description>Whether to try sorted bucket merge map join</description>
  </property>

3. Enable Tez Execution Engine

我们可以通过在 Tez 执行引擎上运行来将 Hive 查询的性能至少提高 100% 到 300%，而不是在古老的 Map-reduce 引擎上运行 Hive 查询。我们可以从 hive shell 中启用具有以下属性的 Tez 引擎。

hive> set hive.execution.engine=tez;

4. Enable Parallel Execution

Hive 将查询转换为一个或多个阶段。阶段可以是 MapReduce 阶段、采样阶段、合并阶段、限制阶段。默认情况下，Hive 一次执行一个阶段。一个特定的作业可能由一些相互不依赖的阶段组成，并且可以在

并行，可能使整个工作更快地完成。可以通过设置以下属性来启用并行执行。

  <property>
    <name>hive.exec.parallel</name>
    <value>true</value>
    <description>Whether to execute jobs in parallel</description>
  </property>
  <property>
    <name>hive.exec.parallel.thread.number</name>
    <value>8</value>
    <description>How many jobs at most can be executed in parallel</description>
  </property>

5. Enable Vectorization

矢量化功能仅在 hive-0.13.1 版本中首次引入 hive。通过向量化查询执行，我们可以提高扫描、聚合、过滤和连接等操作的性能，方法是一次执行 1024 行的批处理，而不是每次执行单行。

我们可以通过在 hive shell 或 hive-site.xml 文件中设置以下三个属性来启用矢量化查询执行。

hive> set hive.vectorized.execution.enabled = true;
hive> set hive.vectorized.execution.reduce.enabled = true;
hive> set hive.vectorized.execution.reduce.groupby.enabled = true;

6. Enable Cost Based Optimization

最近的 Hive 版本提供了基于成本的优化功能，可以根据查询成本实现进一步的优化，从而导致潜在的不同决策：如何对连接进行排序、执行哪种类型的连接、并行度等。

可以通过在 hive-site.xml 文件中设置以下属性来启用基于成本的优化。

  <property>
    <name>hive.cbo.enable</name>
    <value>true</value>
    <description>Flag to control enabling Cost Based Optimizations using Calcite framework.</description>
  </property>
  <property>
    <name>hive.compute.query.using.stats</name>
    <value>true</value>
    <description>
      When set to true Hive will answer a few queries like count(1) purely using stats
      stored in metastore. For basic stats collection turn on the config hive.stats.autogather to true.
      For more advanced stats collection need to run analyze table queries.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.partition.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires partition level basic
      statistics like number of rows, data size and file size. Partition statistics are fetched from
      metastore. Fetching partition statistics for each needed partition can be expensive when the
      number of partitions is high. This flag can be used to disable fetching of partition statistics
      from metastore. When this flag is disabled, Hive will make calls to filesystem to get file sizes
      and will estimate the number of rows from row schema.
    </description>
  </property>
  <property>
    <name>hive.stats.fetch.column.stats</name>
    <value>true</value>
    <description>
      Annotation of operator tree with statistics information requires column statistics.
      Column statistics are fetched from metastore. Fetching column statistics for each needed column
      can be expensive when the number of columns is high. This flag can be used to disable fetching
      of column statistics from metastore.
    </description>
  </property>
  <property>
    <name>hive.stats.autogather</name>
    <value>true</value>
    <description>A flag to gather statistics automatically during the INSERT OVERWRITE command.</description>
  </property>
  <property>
    <name>hive.stats.dbclass</name>
    <value>fs</value>
    <description>
      Expects one of the pattern in [jdbc(:.*), hbase, counter, custom, fs].
      The storage that stores temporary Hive statistics. In filesystem based statistics collection ('fs'), 
      each task writes statistics it has collected in a file on the filesystem, which will be aggregated 
      after the job has finished. Supported values are fs (filesystem), jdbc:database (where database 
      can be derby, mysql, etc.), hbase, counter, and custom as defined in StatsSetupConst.java.
    </description>
  </property>

【讨论】：

很好的解释