弹性搜索崩溃答案

【问题标题】：Elasticsearch crashing弹性搜索崩溃
【发布时间】：2021-09-28 00:35:45
【问题描述】：

我们时常遇到 Elasticsearch 崩溃的问题。它有时还会使 RAM + CPU 激增，服务器变得无响应。

我们保留了大部分设置，但必须向 JVM 堆 (48GB) 添加更多 RAM 以使其不会频繁崩溃。

我开始挖掘，显然 32GB 是您应该使用的最大值。我们会对此进行调整。

服务器是：

CentOS 7
RAM: 125GB
CPU: 40 threads
Hard Drive: 2x Raid 1 NVME

^^^ 有足够多的硬件来处理这样的事情，但有些事情告诉我需要做更多的配置来处理这么多的数据。

我们正在经营一家 Magento 2.4.3 CE 商店，其中包含大约 400,000 种产品。

这是我们所有的配置文件：

jvm.options 文件

    ## JVM configuration
    
    ################################################################
    ## IMPORTANT: JVM heap size
    ################################################################
    ##
    ## You should always set the min and max JVM heap
    ## size to the same value. For example, to set
    ## the heap to 4 GB, set:
    ##
    ## -Xms4g
    ## -Xmx4g
    ##
    ## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
    ## for more information
    ##
    ################################################################
    
    # Xms represents the initial size of total heap space
    # Xmx represents the maximum size of total heap space
    
    -Xms48g
    -Xmx48g
    
    ################################################################
    ## Expert settings
    ################################################################
    ##
    ## All settings below this section are considered
    ## expert settings. Don't tamper with them unless
    ## you understand what you are doing
    ##
    ################################################################
    
    ## GC configuration
    8-13:-XX:+UseConcMarkSweepGC
    8-13:-XX:CMSInitiatingOccupancyFraction=75
    8-13:-XX:+UseCMSInitiatingOccupancyOnly
    
    ## G1GC Configuration
    # NOTE: G1 GC is only supported on JDK version 10 or later
    # to use G1GC, uncomment the next two lines and update the version on the
    # following three lines to your version of the JDK
    # 10-13:-XX:-UseConcMarkSweepGC
    # 10-13:-XX:-UseCMSInitiatingOccupancyOnly
    14-:-XX:+UseG1GC
    14-:-XX:G1ReservePercent=25
    14-:-XX:InitiatingHeapOccupancyPercent=30
    
    ## DNS cache policy
    # cache ttl in seconds for positive DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.ttl; set to -1 to cache forever
    -Des.networkaddress.cache.ttl=60
    # cache ttl in seconds for negative DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.negative ttl; set to -1 to cache
    # forever
    -Des.networkaddress.cache.negative.ttl=10
    
    ## optimizations
    
    # pre-touch memory pages used by the JVM during initialization
    -XX:+AlwaysPreTouch
    
    ## basic
    
    # explicitly set the stack size
    -Xss1m
    
    # set to headless, just in case
    -Djava.awt.headless=true
    
    # ensure UTF-8 encoding by default (e.g. filenames)
    -Dfile.encoding=UTF-8
    
    # use our provided JNA always versus the system one
    -Djna.nosys=true
    
    # turn off a JDK optimization that throws away stack traces for common
    # exceptions because stack traces are important for debugging
    -XX:-OmitStackTraceInFastThrow
    
    # enable helpful NullPointerExceptions (https://openjdk.java.net/jeps/358), if
    # they are supported
    14-:-XX:+ShowCodeDetailsInExceptionMessages
    
    # flags to configure Netty
    -Dio.netty.noUnsafe=true
    -Dio.netty.noKeySetOptimization=true
    -Dio.netty.recycler.maxCapacityPerThread=0
    
    # log4j 2
    -Dlog4j.shutdownHookEnabled=false
    -Dlog4j2.disable.jmx=true
    
    -Djava.io.tmpdir=${ES_TMPDIR}
    
    ## heap dumps
    
    # generate a heap dump when an allocation from the Java heap fails
    # heap dumps are created in the working directory of the JVM
    -XX:+HeapDumpOnOutOfMemoryError
    
    # specify an alternative path for heap dumps; ensure the directory exists and
    # has sufficient space
    -XX:HeapDumpPath=/var/lib/elasticsearch
    
    # specify an alternative path for JVM fatal error logs
    -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
    
    ## JDK 8 GC logging
    
    8:-XX:+PrintGCDetails
    8:-XX:+PrintGCDateStamps
    8:-XX:+PrintTenuringDistribution
    8:-XX:+PrintGCApplicationStoppedTime
    8:-Xloggc:/var/log/elasticsearch/gc.log
    8:-XX:+UseGCLogFileRotation
    8:-XX:NumberOfGCLogFiles=32
    8:-XX:GCLogFileSize=64m
    
    # JDK 9+ GC logging
    9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
    # due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
    # time/date parsing will break in an incompatible way for some date patterns and locals
    9-:-Djava.locale.providers=COMPAT
    
    # temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
    10-:-XX:UseAVX=2


**elasticsearch.yml**

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
#discovery.zen.minimum_master_nodes: 
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
xpack.security.enabled: true

我研究了 RAM + CPU 峰值可能是由于未设置这些设置：

gateway.expected_nodes: 10
gateway.recover_after_time: 5m

这是来自 Elasticsearch 的一些其他数据：

curl -XGET --user username:password http://localhost:9200/

{
  "name" : "web1.example.com",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "S8fFQ993QDWkLY8lZtp_mQ",
  "version" : {
    "number" : "7.13.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "4d960a0733be83dd2543ca018aa4ddc42e956800",
    "build_date" : "2021-06-10T21:01:55.251515791Z",
    "build_snapshot" : false,
    "lucene_version" : "8.8.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

curl --user username:password -sS http://localhost:9200/_cluster/health?pretty

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 5,
  "active_shards" : 5,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 55.55555555555556
}

curl --user username:password -sS http://localhost:9200/_cluster/allocation/explain?pretty

{
  "index" : "example-amasty_product_1_v156",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2021-09-14T16:52:28.854Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "2THEUTSaQdmOJAAhTTN71g",
      "node_name" : "web1.example.com",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "134622244864",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "512",
        "ml.max_jvm_size" : "51539607552"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node"
        }
      ]
    }
  ]
}

^^^ 问题是我不知道如何在一台机器上设置多个节点。

据我了解，错误配置是我们正在运行仅一个节点。根据我的阅读，绿色状态需要 3 个主节点。

如何在单机上设置多个节点，是否需要增加数据节点？

我的主要怀疑：

没有足够的主/数据节点
较新的垃圾收集器存在问题（G1GC 已启用 - 我不确定如何从配置中确定当前启用了哪个）--- 已经弄清楚了 - 使用了 G1。
在发生崩溃时没有恢复设置（gateway.expected_nodes、gateway.recover_after_time）

更新：

这是来自 elasticsearch.log 的错误日志

https://privfile.com/download.php?fid=6141256d5e639-MTAwNTM=

很抱歉日志文件不适合 Stackoverflow 帖子:)

粘贴箱：

第 1 部分：https://pastebin.com/86sLM9BD 第 2 部分：https://pastebin.com/1VEn63TQ

更新：

输出：_cluster/stats?pretty&human

https://pastebin.com/EM8ZMVst

更新：

想出了如何限制副本的数量。

这可以通过模板完成：

PUT _template/all
{
  "template": "*",
  "settings": {
    "number_of_replicas": 0
  }
}

明天我会测试它，如果它产生效果并将状态变为绿色。

我认为它不会在性能方面发挥任何作用，但我们拭目以待。

我正在考虑其他建议：

RAM 使用限制为 31GB
文件描述符已设置为 65535
最大线程数已设置为 4096
已增加并配置了最大大小的虚拟内存检查
地图数量上限增加到 262144
G1GC 被禁用（默认）

我正在尝试的一件事是减少：

8-13:-XX:CMSInitiatingOccupancyFraction=75

到

8-13:-XX:CMSInitiatingOccupancyFraction=70

我相信这将加速垃圾收集并防止内存溢出错误。如果有帮助，我们将尝试向上/向下调整以查看它。

切换到 G1GC

我意识到这并没有真正受到鼓励，但有一些文章讨论了类似的内存不足问题，其中切换到 G1GC 有助于解决该问题：https://medium.com/naukri-engineering/garbage-collection-in-elasticsearch-and-the-g1gc-16b79a447181

这将是我要尝试的最后一件事。

更新：

在所有这些更改之后，索引最终变为绿色（模板修复工作）。

一夜之间也没有任何问题。它不像 50GB 的 RAM 那样活泼，但至少它是稳定的。

对未来 Elasticsearch 疑难解答者的一般建议：通过 bootstrap checks - 这至少会让您处于性能基线。

更新：发现 JVM 从两个位置获取设置并将它们用于不同目的的问题。

看起来系统管理员将 heap_size.options 放入

/etc/elasticsearch/jvm.options.d

JVM 设置为 31GB，但主 jvm.options 文件显示 8GB。这影响了仅使用 8GB RAM 运行的 GC 收集线程（但所有 31GB RAM 仍被占用）。

我删除了该文件并在 jvm.options 文件中添加了 31GB。

这在一定程度上稳定了局势，但 GC 仍在高速收集。

只要我将任何属性添加到要索引的列表中，GC 收集就会再次溢出内存。

唯一可以保存的方法是删除索引并重新建立索引。

我正在考虑对整个 Elasticsearch 安装进行核对，然后自己进行。

这应该没那么难。

【问题讨论】：

黄色集群表示部分副本无法分配，因为你只有1个节点，需要将number of replicas设置为0，这样会改变群集为绿色。至于为什么您的集群崩溃，您需要提供更多信息，您可以在崩溃时共享日志吗？还尝试将堆设置为 30 GB 而不是 48 GB documentation。
@leandrojmp 谢谢！我用 elasticsearch.log 更新了帖子。我现在被困在没有 DEV 服务器的情况下（用于其他用途），所以我需要在我的本地机器 docker 设置上解决这个问题。

标签： elasticsearch magento2 elasticsearch-7

【解决方案1】：

一些事情

高cpu或内存使用不会因为没有设置那些gateway设置，并且作为一个单节点集群它们有些无关紧要
我们建议保持堆 https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-heap-size
您永远不能在与主节点相同的节点上分配副本分片。因此，对于单节点集群，您要么需要删除副本（有风险），要么向集群添加另一个（理想情况下）2 个节点
在同一台主机上设置多节点集群有点毫无意义。确保您的副本将被分配，但如果您丢失主机，您将丢失所有数据

我建议查看https://www.elastic.co/guide/en/elasticsearch/reference/7.14/bootstrap-checks.html 并应用它所讨论的设置，因为即使您运行的是单个节点，这些也是我们所说的生产就绪设置

除此之外，您是否启用了监控？您的 Elasticsearch 日志显示什么？热线程呢？还是慢日志？

（顺便说一下，它是 Elasticsearch，s 不是驼峰式；））

【讨论】：

谢谢！我用 elasticsearch.log 和 camelcase 修复更新了帖子 :)
谢谢。作为未来的 fyi gist/pastebin/etc 更适合日志，比上传+下载文件更容易共享:)
好吧，有很多 gc 没有做太多。 _cluster/stats?pretty&human API 的输出是什么（请在 gist/pastebin/etc 中）？

【解决方案2】：

我们已经解决了这个问题。问题是安装错误。

有些东西工作不正常（仍然不知道确切的问题是什么）。

ES 和 Java 都已重新安装。我已将 ES 与在我的开发环境中运行的特定版本相匹配。

您可以在这里看到 GC 终于正常工作了。

我们还直接从源代码中获得了 ES。之前的安装来自一些随机存储库。

我加入了公司需要的所有属性，它甚至没有注意到 - 稳定和快速。

感谢所有帮助我完成这些步骤的人，因为我不会在不知道我已尽一切可能稳定它的情况下破坏 ES 安装。

这也给了我配置 ES 的教训 :)

【讨论】：