【问题标题】:Elasticsearch crashing弹性搜索崩溃
【发布时间】:2021-09-28 00:35:45
【问题描述】:

我们时常遇到 Elasticsearch 崩溃的问题。它有时还会使 RAM + CPU 激增,服务器变得无响应。

我们保留了大部分设置,但必须向 JVM 堆 (48GB) 添加更多 RAM 以使其不会频繁崩溃。

我开始挖掘,显然 32GB 是您应该使用的最大值。我们会对此进行调整。

服务器是:

CentOS 7
RAM: 125GB
CPU: 40 threads
Hard Drive: 2x Raid 1 NVME

^^^ 有足够多的硬件来处理这样的事情,但有些事情告诉我需要做更多的配置来处理这么多的数据。

我们正在经营一家 Magento 2.4.3 CE 商店,其中包含大约 400,000 种产品

这是我们所有的配置文件:

jvm.options 文件

    ## JVM configuration
    
    ################################################################
    ## IMPORTANT: JVM heap size
    ################################################################
    ##
    ## You should always set the min and max JVM heap
    ## size to the same value. For example, to set
    ## the heap to 4 GB, set:
    ##
    ## -Xms4g
    ## -Xmx4g
    ##
    ## See https://www.elastic.co/guide/en/elasticsearch/reference/current/heap-size.html
    ## for more information
    ##
    ################################################################
    
    # Xms represents the initial size of total heap space
    # Xmx represents the maximum size of total heap space
    
    -Xms48g
    -Xmx48g
    
    ################################################################
    ## Expert settings
    ################################################################
    ##
    ## All settings below this section are considered
    ## expert settings. Don't tamper with them unless
    ## you understand what you are doing
    ##
    ################################################################
    
    ## GC configuration
    8-13:-XX:+UseConcMarkSweepGC
    8-13:-XX:CMSInitiatingOccupancyFraction=75
    8-13:-XX:+UseCMSInitiatingOccupancyOnly
    
    ## G1GC Configuration
    # NOTE: G1 GC is only supported on JDK version 10 or later
    # to use G1GC, uncomment the next two lines and update the version on the
    # following three lines to your version of the JDK
    # 10-13:-XX:-UseConcMarkSweepGC
    # 10-13:-XX:-UseCMSInitiatingOccupancyOnly
    14-:-XX:+UseG1GC
    14-:-XX:G1ReservePercent=25
    14-:-XX:InitiatingHeapOccupancyPercent=30
    
    ## DNS cache policy
    # cache ttl in seconds for positive DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.ttl; set to -1 to cache forever
    -Des.networkaddress.cache.ttl=60
    # cache ttl in seconds for negative DNS lookups noting that this overrides the
    # JDK security property networkaddress.cache.negative ttl; set to -1 to cache
    # forever
    -Des.networkaddress.cache.negative.ttl=10
    
    ## optimizations
    
    # pre-touch memory pages used by the JVM during initialization
    -XX:+AlwaysPreTouch
    
    ## basic
    
    # explicitly set the stack size
    -Xss1m
    
    # set to headless, just in case
    -Djava.awt.headless=true
    
    # ensure UTF-8 encoding by default (e.g. filenames)
    -Dfile.encoding=UTF-8
    
    # use our provided JNA always versus the system one
    -Djna.nosys=true
    
    # turn off a JDK optimization that throws away stack traces for common
    # exceptions because stack traces are important for debugging
    -XX:-OmitStackTraceInFastThrow
    
    # enable helpful NullPointerExceptions (https://openjdk.java.net/jeps/358), if
    # they are supported
    14-:-XX:+ShowCodeDetailsInExceptionMessages
    
    # flags to configure Netty
    -Dio.netty.noUnsafe=true
    -Dio.netty.noKeySetOptimization=true
    -Dio.netty.recycler.maxCapacityPerThread=0
    
    # log4j 2
    -Dlog4j.shutdownHookEnabled=false
    -Dlog4j2.disable.jmx=true
    
    -Djava.io.tmpdir=${ES_TMPDIR}
    
    ## heap dumps
    
    # generate a heap dump when an allocation from the Java heap fails
    # heap dumps are created in the working directory of the JVM
    -XX:+HeapDumpOnOutOfMemoryError
    
    # specify an alternative path for heap dumps; ensure the directory exists and
    # has sufficient space
    -XX:HeapDumpPath=/var/lib/elasticsearch
    
    # specify an alternative path for JVM fatal error logs
    -XX:ErrorFile=/var/log/elasticsearch/hs_err_pid%p.log
    
    ## JDK 8 GC logging
    
    8:-XX:+PrintGCDetails
    8:-XX:+PrintGCDateStamps
    8:-XX:+PrintTenuringDistribution
    8:-XX:+PrintGCApplicationStoppedTime
    8:-Xloggc:/var/log/elasticsearch/gc.log
    8:-XX:+UseGCLogFileRotation
    8:-XX:NumberOfGCLogFiles=32
    8:-XX:GCLogFileSize=64m
    
    # JDK 9+ GC logging
    9-:-Xlog:gc*,gc+age=trace,safepoint:file=/var/log/elasticsearch/gc.log:utctime,pid,tags:filecount=32,filesize=64m
    # due to internationalization enhancements in JDK 9 Elasticsearch need to set the provider to COMPAT otherwise
    # time/date parsing will break in an incompatible way for some date patterns and locals
    9-:-Djava.locale.providers=COMPAT
    
    # temporary workaround for C2 bug with JDK 10 on hardware with AVX-512
    10-:-XX:UseAVX=2


**elasticsearch.yml**

# ======================== Elasticsearch Configuration =========================
#
# NOTE: Elasticsearch comes with reasonable defaults for most settings.
#       Before you set out to tweak and tune the configuration, make sure you
#       understand what are you trying to accomplish and the consequences.
#
# The primary way of configuring a node is via this file. This template lists
# the most important settings you may want to configure for a production cluster.
#
# Please consult the documentation for further information on configuration options:
# https://www.elastic.co/guide/en/elasticsearch/reference/index.html
#
# ---------------------------------- Cluster -----------------------------------
#
# Use a descriptive name for your cluster:
#
#cluster.name: my-application
#
# ------------------------------------ Node ------------------------------------
#
# Use a descriptive name for the node:
#
#node.name: node-1
#
# Add custom attributes to the node:
#
#node.attr.rack: r1
#
# ----------------------------------- Paths ------------------------------------
#
# Path to directory where to store the data (separate multiple locations by comma):
#
path.data: /var/lib/elasticsearch
#
# Path to log files:
#
path.logs: /var/log/elasticsearch
#
# ----------------------------------- Memory -----------------------------------
#
# Lock the memory on startup:
#
#bootstrap.memory_lock: true
#
# Make sure that the heap size is set to about half the memory available
# on the system and that the owner of the process is allowed to use this
# limit.
#
# Elasticsearch performs poorly when the system is swapping the memory.
#
# ---------------------------------- Network -----------------------------------
#
# Set the bind address to a specific IP (IPv4 or IPv6):
#
#network.host: 192.168.0.1
#
# Set a custom port for HTTP:
#
#http.port: 9200
#
# For more information, consult the network module documentation.
#
# --------------------------------- Discovery ----------------------------------
#
# Pass an initial list of hosts to perform discovery when new node is started:
# The default list of hosts is ["127.0.0.1", "[::1]"]
#
#discovery.zen.ping.unicast.hosts: ["host1", "host2"]
#
# Prevent the "split brain" by configuring the majority of nodes (total number of master-eligible nodes / 2 + 1):
#
#discovery.zen.minimum_master_nodes: 
#
# For more information, consult the zen discovery module documentation.
#
# ---------------------------------- Gateway -----------------------------------
#
# Block initial recovery after a full cluster restart until N nodes are started:
#
#gateway.recover_after_nodes: 3
#
# For more information, consult the gateway module documentation.
#
# ---------------------------------- Various -----------------------------------
#
# Require explicit names when deleting indices:
#
#action.destructive_requires_name: true
xpack.security.enabled: true

我研究了 RAM + CPU 峰值可能是由于未设置这些设置:

gateway.expected_nodes: 10
gateway.recover_after_time: 5m

这是来自 Elasticsearch 的一些其他数据:

curl -XGET --user username:password http://localhost:9200/

{
  "name" : "web1.example.com",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "S8fFQ993QDWkLY8lZtp_mQ",
  "version" : {
    "number" : "7.13.2",
    "build_flavor" : "default",
    "build_type" : "rpm",
    "build_hash" : "4d960a0733be83dd2543ca018aa4ddc42e956800",
    "build_date" : "2021-06-10T21:01:55.251515791Z",
    "build_snapshot" : false,
    "lucene_version" : "8.8.2",
    "minimum_wire_compatibility_version" : "6.8.0",
    "minimum_index_compatibility_version" : "6.0.0-beta1"
  },
  "tagline" : "You Know, for Search"
}

curl --user username:password -sS http://localhost:9200/_cluster/health?pretty

{
  "cluster_name" : "elasticsearch",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 1,
  "number_of_data_nodes" : 1,
  "active_primary_shards" : 5,
  "active_shards" : 5,
  "relocating_shards" : 0,
  "initializing_shards" : 0,
  "unassigned_shards" : 4,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 55.55555555555556
}

curl --user username:password -sS http://localhost:9200/_cluster/allocation/explain?pretty

{
  "index" : "example-amasty_product_1_v156",
  "shard" : 0,
  "primary" : false,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "INDEX_CREATED",
    "at" : "2021-09-14T16:52:28.854Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes",
  "node_allocation_decisions" : [
    {
      "node_id" : "2THEUTSaQdmOJAAhTTN71g",
      "node_name" : "web1.example.com",
      "transport_address" : "127.0.0.1:9300",
      "node_attributes" : {
        "ml.machine_memory" : "134622244864",
        "xpack.installed" : "true",
        "transform.node" : "true",
        "ml.max_open_jobs" : "512",
        "ml.max_jvm_size" : "51539607552"
      },
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "same_shard",
          "decision" : "NO",
          "explanation" : "a copy of this shard is already allocated to this node"
        }
      ]
    }
  ]
}

^^^ 问题是我不知道如何在一台机器上设置多个节点。

据我了解,错误配置是我们正在运行仅一个节点。根据我的阅读,绿色状态需要 3 个主节点

如何在单机上设置多个节点,是否需要增加数据节点

我的主要怀疑:

  • 没有足够的主/数据节点
  • 较新的垃圾收集器存在问题(G1GC 已启用 - 我不确定如何从配置中确定当前启用了哪个)--- 已经弄清楚了 - 使用了 G1。
  • 在发生崩溃时没有恢复设置(gateway.expected_nodes、gateway.recover_after_time)

更新:

这是来自 elasticsearch.log 的错误日志

https://privfile.com/download.php?fid=6141256d5e639-MTAwNTM=

很抱歉日志文件不适合 Stackoverflow 帖子:)

粘贴箱:

第 1 部分:https://pastebin.com/86sLM9BD 第 2 部分:https://pastebin.com/1VEn63TQ

更新:

输出:_cluster/stats?pretty&human

https://pastebin.com/EM8ZMVst

更新:

想出了如何限制副本的数量。

这可以通过模板完成:

PUT _template/all
{
  "template": "*",
  "settings": {
    "number_of_replicas": 0
  }
}

明天我会测试它,如果它产生效果并将状态变为绿色。

我认为它不会在性能方面发挥任何作用,但我们拭目以待。

我正在考虑其他建议:

  • RAM 使用限制为 31GB
  • 文件描述符已设置为 65535
  • 最大线程数已设置为 4096
  • 已增加并配置了最大大小的虚拟内存检查
  • 地图数量上限增加到 262144
  • G1GC 被禁用(默认)

我正在尝试的一件事是减少:

8-13:-XX:CMSInitiatingOccupancyFraction=75

8-13:-XX:CMSInitiatingOccupancyFraction=70

我相信这将加速垃圾收集并防止内存溢出错误。如果有帮助,我们将尝试向上/向下调整以查看它。

切换到 G1GC

我意识到这并没有真正受到鼓励,但有一些文章讨论了类似的内存不足问题,其中切换到 G1GC 有助于解决该问题:https://medium.com/naukri-engineering/garbage-collection-in-elasticsearch-and-the-g1gc-16b79a447181

这将是我要尝试的最后一件事。

更新:

在所有这些更改之后,索引最终变为绿色(模板修复工作)。

一夜之间也没有任何问题。它不像 50GB 的 RAM 那样活泼,但至少它是稳定的。

对未来 Elasticsearch 疑难解答者的一般建议:通过 bootstrap checks - 这至少会让您处于性能基线。

更新:发现 JVM 从两个位置获取设置并将它们用于不同目的的问题。

看起来系统管理员将 heap_size.options 放入

/etc/elasticsearch/jvm.options.d

JVM 设置为 31GB,但主 jvm.options 文件显示 8GB。这影响了仅使用 8GB RAM 运行的 GC 收集线程(但所有 31GB RAM 仍被占用)。

我删除了该文件并在 jvm.options 文件中添加了 31GB。

这在一定程度上稳定了局势,但 GC 仍在高速收集。

只要我将任何属性添加到要索引的列表中,GC 收集就会再次溢出内存。

唯一可以保存的方法是删除索引并重新建立索引。

我正在考虑对整个 Elasticsearch 安装进行核对,然后自己进行。

这应该没那么难。

【问题讨论】:

  • 黄色集群表示部分副本无法分配,因为你只有1个节点,需要将number of replicas设置为0,这样会改变群集为绿色。至于为什么您的集群崩溃,您需要提供更多信息,您可以在崩溃时共享日志吗?还尝试将堆设置为 30 GB 而不是 48 GB documentation
  • @leandrojmp 谢谢!我用 elasticsearch.log 更新了帖子。我现在被困在没有 DEV 服务器的情况下(用于其他用途),所以我需要在我的本地机器 docker 设置上解决这个问题。

标签: elasticsearch magento2 elasticsearch-7


【解决方案1】:

一些事情

  • 高cpu或内存使用不会因为没有设置那些gateway设置,并且作为一个单节点集群它们有些无关紧要
  • 我们建议保持堆 https://www.elastic.co/guide/en/elasticsearch/reference/current/advanced-configuration.html#set-jvm-heap-size
  • 您永远不能在与主节点相同的节点上分配副本分片。因此,对于单节点集群,您要么需要删除副本(有风险),要么向集群添加另一个(理想情况下)2 个节点
  • 在同一台主机上设置多节点集群有点毫无意义。确保您的副本将被分配,但如果您丢失主机,您将丢失所有数据

我建议查看https://www.elastic.co/guide/en/elasticsearch/reference/7.14/bootstrap-checks.html 并应用它所讨论的设置,因为即使您运行的是单个节点,这些也是我们所说的生产就绪设置

除此之外,您是否启用了监控?您的 Elasticsearch 日志显示什么?热线程呢?还是慢日志?

(顺便说一下,它是 Elasticsearch,s 不是驼峰式;))

【讨论】:

  • 谢谢!我用 elasticsearch.log 和 camelcase 修复更新了帖子 :)
  • 谢谢。作为未来的 fyi gist/pastebin/etc 更适合日志,比上传+下载文件更容易共享:)
  • 好吧,有很多 gc 没有做太多。 _cluster/stats?pretty&human API 的输出是什么(请在 gist/pastebin/etc 中)?
【解决方案2】:

我们已经解决了这个问题。问题是安装错误。

有些东西工作不正常(仍然不知道确切的问题是什么)。

ES 和 Java 都已重新安装。我已将 ES 与在我的开发环境中运行的特定版本相匹配。

您可以在这里看到 GC 终于正常工作了。

我们还直接从源代码中获得了 ES。之前的安装来自一些随机存储库。

我加入了公司需要的所有属性,它甚至没有注意到 - 稳定和快速。

感谢所有帮助我完成这些步骤的人,因为我不会在不知道我已尽一切可能稳定它的情况下破坏 ES 安装。

这也给了我配置 ES 的教训 :)

【讨论】:

    猜你喜欢
    • 2023-03-28
    • 1970-01-01
    • 2019-12-02
    • 2021-11-23
    • 2018-10-26
    • 1970-01-01
    • 1970-01-01
    • 2018-06-08
    • 1970-01-01
    相关资源
    最近更新 更多