SPARK：YARN 杀死超过内存限制的容器答案

【问题标题】：SPARK: YARN kills containers for exceeding memory limitsSPARK：YARN 杀死超过内存限制的容器
【发布时间】：2017-04-04 15:38:58
【问题描述】：

我们目前遇到的问题是，在 YARN 上运行时，Spark 作业看到许多容器因超出内存限制而被杀死。

16/11/18 17:58:52 WARN TaskSetManager: Lost task 53.0 in stage 49.0 (TID 32715, XXXXXXXXXX): 
  ExecutorLostFailure (executor 23 exited caused by one of the running tasks) 
  Reason: Container killed by YARN for exceeding memory limits. 12.4 GB of 12 GB physical memory used. 
    Consider boosting spark.yarn.executor.memoryOverhead.

以下参数正在通过 spark-submit 传递：

--executor-memory=6G
--driver-memory=4G
--conf "spark.yarn.executor.memoryOverhead=6G"`

我使用的是 Spark 2.0.1。

在阅读了几篇关于 YARN 杀死容器的帖子（例如 How to avoid Spark executor from getting lost and yarn container killing it due to memory limit?）后，我们将 memoryOverhead 增加到了这个值。

考虑到我的参数和日志消息，似乎“当 Yarn 的内存使用量大于 (executor-memory + executor.memoryOverhead) 时，Yarn 会杀死执行程序”。

继续增加此开销以希望最终我们找到一个不会发生这些错误的值是不切实际的。我们在几个不同的工作中看到了这个问题。对于我应该更改的参数、我应该检查的内容、我应该从哪里开始调试等方面的任何建议，我将不胜感激。我能够提供进一步的配置选项等。

【问题讨论】：

你使用 Spark SQL 吗？
当您使用大数据集时，您可以尝试将spark-defaults.conf 中的spark.default.parallelism 和spark.sql.shuffle.partitions 增加到更高的值。这将减少内存使用量。
好的，我试试看
好吧，Spark 2.x 肯定比以前使用了更多的堆外内存，因此 memoryOverhead 需要设置得比我们习惯的要高很多。我曾经遵循的规则是，如果我的 memoryOverhead 必须设置为超过可用内存的 1/3，则出现问题，我需要重新分区我的数据，但现在我们的作业运行时 memoryOverhead 占用了 2/3可用内存，只留下一小部分用于执行程序内存。恐怕我们所有的记忆设置都在某种程度上基于反复试验和直觉......
我们也遇到过这个问题，我们解决了这个问题以激发动态分配。只需添加 2G 作为开销。

标签： apache-spark hadoop-yarn

【解决方案1】：

您可以在spark-defaults.conf 中通过以下配置减少内存使用量：

spark.default.parallelism
spark.sql.shuffle.partitions

当您为spark.sql.shuffle.partitions 使用超过 2000 个分区时会有所不同。可以在 Github 上 spark 的代码中看到：

private[spark] object MapStatus {

  def apply(loc: BlockManagerId, uncompressedSizes: Array[Long]): MapStatus = {
    if (uncompressedSizes.length > 2000) {
      HighlyCompressedMapStatus(loc, uncompressedSizes)
    } else {
      new CompressedMapStatus(loc, uncompressedSizes)
    }
}

我建议尝试使用超过 2000 个分区进行测试。当您使用非常庞大的数据集时，有时可能会更快。根据this，您的任务可以短至 200 毫秒。正确的配置不容易找到，但根据您的工作量，它可能会影响几个小时。

【讨论】：