【发布时间】:2021-09-28 00:35:45
【问题描述】:
我们时常遇到 Elasticsearch 崩溃的问题。它有时还会使 RAM + CPU 激增,服务器变得无响应。
我们保留了大部分设置,但必须向 JVM 堆 (48GB) 添加更多 RAM 以使其不会频繁崩溃。
我开始挖掘,显然 32GB 是您应该使用的最大值。我们会对此进行调整。
服务器是:
CentOS 7
RAM: 125GB
CPU: 40 threads
Hard Drive: 2x Raid 1 NVME
^^^ 有足够多的硬件来处理这样的事情,但有些事情告诉我需要做更多的配置来处理这么多的数据。
我们正在经营一家 Magento 2.4.3 CE 商店,其中包含大约 400,000 种产品。
这是我们所有的配置文件:
jvm.options 文件
## JVM configuration
################################################################
## IMPORTANT: JVM heap size
################################################################
##
## You should always set the min and max JVM heap
## size to the same value. For example, to set
## the heap to 4 GB, set:
##
## -Xms4g
## -Xmx4g
##
## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
## for more information
##
################################################################
# Xms represents the initial size of total heap space
# Xmx represents the maximum size of total heap space
-Xms48g
-Xmx48g
################################################################
## Expert settings
################################################################
##
## All settings below this section are considered
## expert settings. Don't tamper with them unless
## you understand what you are doing
##
################################################################
## GC configuration
8-13:-XX:+UseConcMarkSweepGC
8-13:-XX:CMSInitiatingOccupancyFraction=75
8-13:-XX:+UseCMSInitiatingOccupancyOnly
## G1GC Configuration
# NOTE: G1 GC is only supported on JDK version 10 or later
# to use G1GC, uncomment the next two lines and update the version on the
# following three lines to your version of the JDK
# 10-13:-XX:-UseConcMarkSweepGC
# 10-13:-XX:-UseCMSInitiatingOccupancyOnly
14-:-XX:+UseG1GC
14-:-XX:G1ReservePercent=25
14-:-XX:InitiatingHeapOccupancyPercent=30
## DNS cache policy
# cache ttl in seconds for positive DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.ttl; set to -1 to cache forever
-Des.networkaddress.cache.ttl=60
# cache ttl in seconds for negative DNS lookups noting that this overrides the
# JDK security property networkaddress.cache.negative ttl; set to -1 to cache
# forever
-Des.networkaddress.cache.negative.ttl=10
## optimizations
# pre-touch memory pages used by the JVM during initialization
-XX:+AlwaysPreTouch
## basic
# explicitly set the stack size
-Xss1m
# set to headless, just in case
-Djava.awt.headless=true
# ensure UTF-8 encoding by default (e.g. filenames)
-Dfile.encoding=UTF-8
# use our provided JNA always versus the system one
-Djna.nosys=true
# turn off a JDK optimization that throws away stack traces for common
# exceptions because stack traces are important for debugging
-XX:-OmitStackTraceInFastThrow
# enable helpful NullPointerExceptions (https://openjdk.java.net/jeps/358), if
# they are supported
14-:-XX:+ShowCodeDetailsInExceptionMessages
# flags to configure Netty
-Dio.netty.noUnsafe=true
-Dio.netty.noKeySetOptimization=true
-Dio.netty.recycler.maxCapacityPerThread=0
# log4j 2
-Dlog4j.shutdownHookEnabled=false
-Dlog4j2.disable.jmx=true
-Djava.io.tmpdir=${ES_TMPDIR}
## heap dumps
# generate a heap dump when an allocation from the Java heap fails
# heap dumps are created in the working directory of the JVM
-XX:+HeapDumpOnOutOfMemoryError
# specify an alternative path for heap dumps; ensure the directory exists and
# has sufficient space
-XX:HeapDumpPath=/var/lib/elasticsearch
# specify an alternative path for JVM fatal error logs
-XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
## JDK 8 GC logging
8:-XX:+PrintGCDetails
8:-XX:+PrintGCDateStamps
8:-XX:+PrintTenuringDistribution
8:-XX:+PrintGCApplicationStoppedTime
8:-Xloggc:/var/log/elasticsearch/gc.log
8:-XX:+UseGCLogFileRotation
8:-XX:NumberOfGCLogFiles=32
8:-XX:GCLogFileSize=64m
# JDK 9+ GC logging
9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
# due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
# time/date parsing will break in an incompatible way for some date patterns and locals
9-:-Djava.locale.providers=COMPAT
# temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
10-:-XX:UseAVX=2
**elasticsearch.yml**
# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
# Before you set out to tweak and tune the configuration, make sure you
# understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
#discovery.zen.minimum_master_nodes:
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
xpack.security.enabled: true
我研究了 RAM + CPU 峰值可能是由于未设置这些设置:
gateway.expected_nodes: 10
gateway.recover_after_time: 5m
这是来自 Elasticsearch 的一些其他数据:
curl -XGET --user username:password http://localhost:9200/
{
"name" : "web1.example.com",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "S8fFQ993QDWkLY8lZtp_mQ",
"version" : {
"number" : "7.13.2",
"build_flavor" : "default",
"build_type" : "rpm",
"build_hash" : "4d960a0733be83dd2543ca018aa4ddc42e956800",
"build_date" : "2021-06-10T21:01:55.251515791Z",
"build_snapshot" : false,
"lucene_version" : "8.8.2",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
curl --user username:password -sS http://localhost:9200/_cluster/health?pretty
{
"cluster_name" : "elasticsearch",
"status" : "yellow",
"timed_out" : false,
"number_of_nodes" : 1,
"number_of_data_nodes" : 1,
"active_primary_shards" : 5,
"active_shards" : 5,
"relocating_shards" : 0,
"initializing_shards" : 0,
"unassigned_shards" : 4,
"delayed_unassigned_shards" : 0,
"number_of_pending_tasks" : 0,
"number_of_in_flight_fetch" : 0,
"task_max_waiting_in_queue_millis" : 0,
"active_shards_percent_as_number" : 55.55555555555556
}
curl --user username:password -sS http://localhost:9200/_cluster/allocation/explain?pretty
{
"index" : "example-amasty_product_1_v156",
"shard" : 0,
"primary" : false,
"current_state" : "unassigned",
"unassigned_info" : {
"reason" : "INDEX_CREATED",
"at" : "2021-09-14T16:52:28.854Z",
"last_allocation_status" : "no_attempt"
},
"can_allocate" : "no",
"allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
"node_allocation_decisions" : [
{
"node_id" : "2THEUTSaQdmOJAAhTTN71g",
"node_name" : "web1.example.com",
"transport_address" : "127.0.0.1:9300",
"node_attributes" : {
"ml.machine_memory" : "134622244864",
"xpack.installed" : "true",
"transform.node" : "true",
"ml.max_open_jobs" : "512",
"ml.max_jvm_size" : "51539607552"
},
"node_decision" : "no",
"weight_ranking" : 1,
"deciders" : [
{
"decider" : "same_shard",
"decision" : "NO",
"explanation" : "a copy of this shard is already allocated to this node"
}
]
}
]
}
^^^ 问题是我不知道如何在一台机器上设置多个节点。
据我了解,错误配置是我们正在运行仅一个节点。根据我的阅读,绿色状态需要 3 个主节点。
如何在单机上设置多个节点,是否需要增加数据节点?
我的主要怀疑:
- 没有足够的主/数据节点
- 较新的垃圾收集器存在问题(G1GC 已启用 - 我不确定如何从配置中确定当前启用了哪个)--- 已经弄清楚了 - 使用了 G1。
- 在发生崩溃时没有恢复设置(gateway.expected_nodes、gateway.recover_after_time)
更新:
这是来自 elasticsearch.log 的错误日志
https://privfile.com/download.php?fid=6141256d5e639-MTAwNTM=
很抱歉日志文件不适合 Stackoverflow 帖子:)
粘贴箱:
第 1 部分:https://pastebin.com/86sLM9BD 第 2 部分:https://pastebin.com/1VEn63TQ
更新:
输出:_cluster/stats?pretty&human
更新:
想出了如何限制副本的数量。
这可以通过模板完成:
PUT _template/all
{
"template": "*",
"settings": {
"number_of_replicas": 0
}
}
明天我会测试它,如果它产生效果并将状态变为绿色。
我认为它不会在性能方面发挥任何作用,但我们拭目以待。
我正在考虑其他建议:
- RAM 使用限制为 31GB
- 文件描述符已设置为 65535
- 最大线程数已设置为 4096
- 已增加并配置了最大大小的虚拟内存检查
- 地图数量上限增加到 262144
- G1GC 被禁用(默认)
我正在尝试的一件事是减少:
8-13:-XX:CMSInitiatingOccupancyFraction=75
到
8-13:-XX:CMSInitiatingOccupancyFraction=70
我相信这将加速垃圾收集并防止内存溢出错误。如果有帮助,我们将尝试向上/向下调整以查看它。
切换到 G1GC
我意识到这并没有真正受到鼓励,但有一些文章讨论了类似的内存不足问题,其中切换到 G1GC 有助于解决该问题:https://medium.com/naukri-engineering/garbage-collection-in-elasticsearch-and-the-g1gc-16b79a447181
这将是我要尝试的最后一件事。
更新:
在所有这些更改之后,索引最终变为绿色(模板修复工作)。
一夜之间也没有任何问题。它不像 50GB 的 RAM 那样活泼,但至少它是稳定的。
对未来 Elasticsearch 疑难解答者的一般建议:通过 bootstrap checks - 这至少会让您处于性能基线。
更新:发现 JVM 从两个位置获取设置并将它们用于不同目的的问题。
看起来系统管理员将 heap_size.options 放入
/etc/elasticsearch/jvm.options.d
JVM 设置为 31GB,但主 jvm.options 文件显示 8GB。这影响了仅使用 8GB RAM 运行的 GC 收集线程(但所有 31GB RAM 仍被占用)。
我删除了该文件并在 jvm.options 文件中添加了 31GB。
这在一定程度上稳定了局势,但 GC 仍在高速收集。
只要我将任何属性添加到要索引的列表中,GC 收集就会再次溢出内存。
唯一可以保存的方法是删除索引并重新建立索引。
我正在考虑对整个 Elasticsearch 安装进行核对,然后自己进行。
这应该没那么难。
【问题讨论】:
-
黄色集群表示部分副本无法分配,因为你只有1个节点,需要将number of replicas设置为
0,这样会改变群集为绿色。至于为什么您的集群崩溃,您需要提供更多信息,您可以在崩溃时共享日志吗?还尝试将堆设置为 30 GB 而不是 48 GB documentation。 -
@leandrojmp 谢谢!我用 elasticsearch.log 更新了帖子。我现在被困在没有 DEV 服务器的情况下(用于其他用途),所以我需要在我的本地机器 docker 设置上解决这个问题。
标签: elasticsearch magento2 elasticsearch-7