【发布时间】:2021-07-02 08:06:09
【问题描述】:
我们将 Elasticsearch 和 Fluentd 用于 Central 日志记录平台。以下是我们的配置详情: Elasticsearch 集群:
Master Nodes: 64Gb Ram, 8 CPU, 9 instances
Data Nodes: 64Gb Ram, 8 CPU, 40 instances
Coordinator Nodes: 64Gb Ram, 8Cpu, 20 instances
Fluentd: 在任何给定时间,我们都有大约 1000 多个 fluentd 实例将日志写入 Elasticsearch 协调器节点。 我们每天创建大约 700-800 个索引,每天总共创建 4K 个分片。我们在集群上最多保留 40K 分片。 我们开始在 Fluentd 方面面临性能问题,其中 fluentd 实例无法写入日志。常见问题是:
1. read time out
2. request time out
3. {"time":"2021-07-02","level":"warn","message":"failed to flush the buffer. retry_time=9 next_retry_seconds=2021-07-02 07:23:08 265795215088800420057/274877906944000000000 +0000 chunk=\"5c61e5fa4909c276a58b2efd158b832d\" error_class=Fluent::Plugin::ElasticsearchOutput::RecoverableRequestFailure error=\"could not push logs to Elasticsearch cluster ({:host=>\\\"logs-es-data.internal.tech\\\", :port=>9200, :scheme=>\\\"http\\\"}): [429] {\\\"error\\\":{\\\"root_cause\\\":[{\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"}],\\\"type\\\":\\\"circuit_breaking_exception\\\",\\\"reason\\\":\\\"[parent] Data too large, data for [<http_request>] would be [32274168710/30gb], which is larger than the limit of [31621696716/29.4gb], real usage: [32268504992/30gb], new bytes reserved: [5663718/5.4mb], usages [request=0/0b, fielddata=0/0b, in_flight_requests=17598408008/16.3gb, model_inference=0/0b, accounting=0/0b]\\\",\\\"bytes_wanted\\\":32274168710,\\\"bytes_limit\\\":31621696716,\\\"durability\\\":\\\"TRANSIENT\\\"},\\\"status\\\":429}\"","worker_id":0}
寻找这方面的指导,我们如何优化我们的日志集群?
【问题讨论】:
-
@Azeem 不,我们已经在 64Gb 内存的服务器上拥有 31Gb 堆。
-
您能分享一下您的 ElasticSearch 配置吗?而且,“读取超时”错误是什么意思?
标签: elasticsearch fluentd