为什么 Prometheus 会消耗这么多内存？答案

【问题标题】：Why does Prometheus consume so much memory?为什么 Prometheus 会消耗这么多内存？
【发布时间】：2019-09-30 14:59:19
【问题描述】：

我正在使用 Prometheus 2.9.2 来监控大型节点环境。作为在我们的环境中测试 Prometheus 最大规模的一部分，我在我们的测试环境中模拟了大量的指标。

我的管理服务器有 16GB 内存和 100GB 磁盘空间。

在规模测试期间，我注意到 Prometheus 进程消耗的内存越来越多，直到进程崩溃。

我注意到，当 Prometheus 的内存使用量上升时，WAL 目录被大量数据文件快速填满。

管理服务器每 15 秒刮一次其节点，存储参数全部设置为默认值。

我想知道为什么会发生这种情况，以及如何/是否可以防止进程崩溃。

谢谢！

【问题讨论】：

您可以通过抓取“/metrics”端点来监控您的 prometheus。我会给你有用的指标。

标签： memory prometheus

【解决方案1】：

内存不足崩溃通常是查询过重的结果。这可以在您的规则之一中设置。（这条规则甚至可能在 grafana 页面而不是 prometheus 本身上运行）

如果您有大量指标，则规则可能会查询所有指标。一种快速解决方法是使用特定标签而不是正则表达式准确指定要查询的指标。

【讨论】：

另外，Prometheus 有一堆pprof 请求处理程序，它们公开有关 CPU 使用情况、内存使用情况、自启动以来的总内存分配等的分析信息。您可以在http://your.prometheus.host:9090/debug/pprof 获得概述。因此，如果您安装了go，您可以简单地使用go pprof http://your.prometheus.host:9090/debug/pprof/heap，然后输入web，然后按Enter 进入出现的命令行提示符。否则，您可以从 github.com/google/pprof 获得 pprof（或通过安装 Golang）。

【解决方案2】：

由于标签的组合取决于你的业务，组合和块可能是无限的，对于prometheus目前的设计没有办法解决内存问题！！！！但是我建议你把小块压缩成大块，这样会减少块的数量。

巨大的内存消耗有两个原因：

prometheus tsdb 有一个名为：“head”的内存块，因为 head 存储了最近几个小时的所有系列，它会吃很多内存。
磁盘上的每个块也吃内存，因为磁盘上的每个块在内存中都有一个索引读取器，令人沮丧的是，一个块的所有标签，发布和符号都缓存在索引读取器结构中，磁盘上的块越多，内存就越多会很受欢迎的。

在 index/index.go 中，你会看到：

type Reader struct {
    b ByteSlice

    // Close that releases the underlying resources of the byte slice.
    c io.Closer

    // Cached hashmaps of section offsets.
    labels map[string]uint64
    // LabelName to LabelValue to offset map.
    postings map[string]map[string]uint64
    // Cache of read symbols. Strings that are returned when reading from the
    // block are always backed by true strings held in here rather than
    // strings that are backed by byte slices from the mmap'd index file. This
    // prevents memory faults when applications work with read symbols after
    // the block has been unmapped. The older format has sparse indexes so a map
    // must be used, but the new format is not so we can use a slice.
    symbolsV1        map[uint32]string
    symbolsV2        []string
    symbolsTableSize uint64

    dec *Decoder

    version int
}

【讨论】：

【解决方案3】：

我们使用了 prometheus 2.19 版，内存性能明显提高。 This Blog highlights how this release tackles memory problems。我强烈建议使用它来改善您的实例资源消耗。

【讨论】：

【解决方案4】：

This article 解释了为什么 Prometheus 在数据摄取期间可能会使用大量内存。如果您需要减少 Prometheus 的内存使用量，那么以下操作会有所帮助：

在Prometheus configs 中增加scrape_interval。
减少抓取目标的数量和/或每个目标抓取的指标。

附：也看看我从事的项目 - VictoriaMetrics。与 Prometheus 相比，它可以使用更少的内存。详情请见this benchmark。

【讨论】：

请说明哪些链接指向您自己的博客和项目。