日志集群的 Elasticsearch 和 Fluentd 优化答案

【问题标题】：Elasticsearch and Fluentd optimisation for log cluster日志集群的 Elasticsearch 和 Fluentd 优化
【发布时间】：2021-07-02 08:06:09
【问题描述】：

我们将 Elasticsearch 和 Fluentd 用于 Central 日志记录平台。以下是我们的配置详情： Elasticsearch 集群：

Master Nodes: 64Gb Ram, 8 CPU, 9 instances
Data Nodes: 64Gb Ram, 8 CPU, 40 instances
Coordinator Nodes: 64Gb Ram, 8Cpu, 20 instances

Fluentd： 在任何给定时间，我们都有大约 1000 多个 fluentd 实例将日志写入 Elasticsearch 协调器节点。我们每天创建大约 700-800 个索引，每天总共创建 4K 个分片。我们在集群上最多保留 40K 分片。我们开始在 Fluentd 方面面临性能问题，其中 fluentd 实例无法写入日志。常见问题是：

 1. read time out
 2. request time out
 3. {"time":"2021-07-02","level":"warn","message":"failed to flush the buffer. retry_time=9 next_retry_seconds=2021-07-02 07:23:08 265795215088800420057/274877906944000000000 +0000 chunk=\"5c61e5fa4909c276a58b2efd158b832d\" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error=\"could not push logs to Elasticsearch cluster ({:host=>\\\"logs-es-data.internal.tech\\\", :port=>9200, :scheme=>\\\"http\\\"}): [429] {\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"}],\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"},\\\"status\\\":429}\"","worker_id":0}

寻找这方面的指导，我们如何优化我们的日志集群？

【问题讨论】：

这能回答你的问题吗？ "[circuit_breaking_exception] [parent]" Data too large, data for "[<http_request>]" would be error
@Azeem 不，我们已经在 64Gb 内存的服务器上拥有 31Gb 堆。
您能分享一下您的 ElasticSearch 配置吗？而且，“读取超时”错误是什么意思？

标签： elasticsearch fluentd

【解决方案1】：

好吧，从外观上看，您已经用尽了 95% 的堆内存的父断路器限制。您提到的错误已在 elasticsearch 文档中提到 - [1]：https://www.elastic.co/guide/en/elasticsearch/reference/current/fix-common-cluster-issues.html#diagnose-circuit-breaker-errors .该页面还提到了您可以采取的减少 JVM 内存压力的几个步骤，这有助于减少此错误。

您也可以尝试将此限制增加到 98%，使用动态命令 -

PUT /_cluster/settings
{
  "persistent" : {
    "indices.breaker.total.limit" : "98%" 
  }
}

但我建议在应用到生产环境之前对其进行性能测试。

由于您的请求是 30GB，这有点太大了，为了获得更可靠的解决方案，我建议您提高日志抓取频率，以便更频繁地使用较小的数据块向 ES 发送帖子。

【讨论】：