为什么在批量插入期间分片被初始化和重定位答案

【问题标题】：Why are shards getting initialized and relocated during bulk insert为什么在批量插入期间分片被初始化和重定位
【发布时间】：2015-10-30 11:13:02
【问题描述】：

我正在尝试将数据批量插入到具有 3 个数据节点的 4 节点弹性搜索集群中。

数据节点规格： 16 CPU - 7GB RAM - 500GB SSD

将数据插入到非数据节点上，并拆分为 5 个分片，并设置为具有 1 个复制。大约有 250GB 的数据要插入。

但是，在每个节点上插入约 40GB 数据并处理一小时后，在整个时间跨度内最大使用约 60% 的 CPU 和约 30% 的 RAM 后，一些分片进入初始化状态：

~$ curl -XGET 'http://localhost:9200/_cluster/health/osm?level=shards&pretty=true'
{
  "cluster_name" : "elastic_osm",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 4,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 5,
  "active_shards" : 9,
  "relocating_shards" : 1,
  "initializing_shards" : 1,
  "unassigned_shards" : 0,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "indices" : {
    "osm" : {
      "status" : "yellow",
      "number_of_shards" : 5,
      "number_of_replicas" : 1,
      "active_primary_shards" : 5,
      "active_shards" : 9,
      "relocating_shards" : 1,
      "initializing_shards" : 1,
      "unassigned_shards" : 0,
      "shards" : {
        "0" : {
          "status" : "yellow",
          "primary_active" : true,
          "active_shards" : 1,
          "relocating_shards" : 0,
          "initializing_shards" : 1,
          "unassigned_shards" : 0
        },
        "1" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "2" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 1,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "3" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        },
        "4" : {
          "status" : "green",
          "primary_active" : true,
          "active_shards" : 2,
          "relocating_shards" : 0,
          "initializing_shards" : 0,
          "unassigned_shards" : 0
        }
      }
    }
  }
}

再深入一点，发现一个节点的堆空间有问题：

~$ curl -XGET 'localhost:9200/osm/_search_shards?pretty=true'
{
  "nodes" : {
    "1DpvDUf7SKywJrBgQqs9eg" : {
      "name" : "elastic-osm-node-1",
      "transport_address" : "inet[/xxx.xxx.x.x:xxxx]",
      "attributes" : {
        "master" : "true"
      }
    },
    "FiBYw-v_QfO3nJQfHflf_w" : {
      "name" : "elastic-osm-node-3",
      "transport_address" : "inet[/xxx.xxx.x.x:x]",
      "attributes" : {
        "master" : "true"
      }
    },
    "ibpt8lGiS6yDJf4e09RN9Q" : {
      "name" : "elastic-osm-node-2",
      "transport_address" : "inet[/xxx.xxx.x.x:xxxx]",
      "attributes" : {
        "master" : "true"
      }
    }
  },
  "shards" : [ [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "ibpt8lGiS6yDJf4e09RN9Q",
    "relocating_node" : null,
    "shard" : 0,
    "index" : "osm"
  }, {
    "state" : "INITIALIZING",
    "primary" : false,
    "node" : "FiBYw-v_QfO3nJQfHflf_w",
    "relocating_node" : null,
    "shard" : 0,
    "index" : "osm",
    "unassigned_info" : {
      "reason" : "ALLOCATION_FAILED",
      "at" : "2015-10-30T10:42:25.539Z",
      "details" : "shard failure [engine failure, reason [already closed by tragic event]][OutOfMemoryError[Java heap space]]"
    }
  } ], [ {
    "state" : "STARTED",
    "primary" : true,
    "node" : "FiBYw-v_QfO3nJQfHflf_w",
    "relocating_node" : null,
    "shard" : 1,
    "index" : "osm"
  }, {
    "state" : "STARTED",
    "primary" : false,
    "node" : "1DpvDUf7SKywJrBgQqs9eg",
    "relocating_node" : null,
    "shard" : 1,
    "index" : "osm"
  } ], [ {
    "state" : "RELOCATING",
    "primary" : false,
    "node" : "FiBYw-v_QfO3nJQfHflf_w",
    "relocating_node" : "1DpvDUf7SKywJrBgQqs9eg",
    "shard" : 2,
    "index" : "osm"
  }, {
    "state" : "STARTED",
    "primary" : true,
    "node" : "ibpt8lGiS6yDJf4e09RN9Q",
    "relocating_node" : null,
    "shard" : 2,
    "index" : "osm"
  }, {
    "state" : "INITIALIZING",
    "primary" : false,
    "node" : "1DpvDUf7SKywJrBgQqs9eg",
    "relocating_node" : "FiBYw-v_QfO3nJQfHflf_w",
    "shard" : 2,
    "index" : "osm"
  } ], [ {
    "state" : "STARTED",
    "primary" : false,
    "node" : "FiBYw-v_QfO3nJQfHflf_w",
    "relocating_node" : null,
    "shard" : 3,
    "index" : "osm"
  }, {
    "state" : "STARTED",
    "primary" : true,
    "node" : "1DpvDUf7SKywJrBgQqs9eg",
    "relocating_node" : null,
    "shard" : 3,
    "index" : "osm"
  } ], [ {
    "state" : "STARTED",
    "primary" : false,
    "node" : "ibpt8lGiS6yDJf4e09RN9Q",
    "relocating_node" : null,
    "shard" : 4,
    "index" : "osm"
  }, {
    "state" : "STARTED",
    "primary" : true,
    "node" : "FiBYw-v_QfO3nJQfHflf_w",
    "relocating_node" : null,
    "shard" : 4,
    "index" : "osm"
  } ] ]
}

但是服务器上设置的 ES_HEAP_SIZE 是内存的一半：

~$ echo $ES_HEAP_SIZE
7233.0m

而且用量只有5g：

~$ free -g
             total       used
Mem:            14          5

如果我再等一会儿，节点就会完全离开集群，所有副本都会进入初始化状态，这会使我的插入失败并停止：

{
    "state" : "INITIALIZING",
    "primary" : false,
    "node" : "ibpt8lGiS6yDJf4e09RN9Q",
    "relocating_node" : null,
    "shard" : 3,
    "index" : "osm",
    "unassigned_info" : {
      "reason" : "NODE_LEFT",
      "at" : "2015-10-30T10:53:32.044Z",
      "details" : "node_left[FiBYw-v_QfO3nJQfHflf_w]"
    }

Conf : 为了加快插入速度，我在数据节点 elasticsearch 配置中使用了这些参数

刷新间隔：-1， threadpool.bulk.size: 16, threadpool.bulk.queue_size: 1000

为什么会这样？我该如何解决这个问题并让我的批量插入成功？最大堆大小是否需要超过 50% 的 RAM？

编辑：由于调整弹性搜索参数不好，我删除了线程池参数，它工作但非常缓慢。 Elasticsearch 的设计目的不是太快地摄取太多数据。

【问题讨论】：

标签： elasticsearch

【解决方案1】：

删除这些设置：

threadpool.bulk.size: 16
threadpool.bulk.queue_size: 1000

这些设置的默认值应该足以避免集群过载。

并确保按照here 的说明正确调整批量索引过程的大小。根据集群/数据，批量需要具有一定的大小。对于那些希望尽可能多地摄取的人，你不能使用任何你想要的值。每个集群都有限制，您应该测试一下。

【讨论】：

批量请求的大小为 5000，由 30 个工作人员发出。这个集群似乎能够以这种速度摄取数据，没有这些参数，当我减少插入数据的工作人员数量时，我的数据节点的 CPU 几乎没有达到 10%。此外，在全速运行过程中，RAM 使用率并没有太大变化。这是否意味着集群能够以该速率摄取数据？集群难道不应该能够到至少 60% 的 CPU 吗？
似乎不是，因为节点正在从集群中删除并且堆使用量不是很大（还没有看到堆问题的证明，但你提到了它）。为 [gc][old] 或 [gc][young] 获取节点日志。你有这样的条目吗？
刚刚检查，所有节点上都没有条目。我知道不建议调整 ES 参数，但摄取容量不应该随集群扩展吗？
是的，这是真的。奇怪的是你没有这些条目。既然您提到了堆问题，我本来希望您的批量进程（以一种糟糕的方式）改变了 Java 的垃圾收集进程。
是的，这是可以预料的。当节点断开连接时，我只会收到数百次此警告，仅此而已：[2015-10-30 [elastic-osm-node-2] failed to perform indices:data/write/bulk[s] on remote replica [elastic- osm-node-1][1DpvDUf7SKywJrBgQqs9eg][elastic-osm-node-1.l][inet[xxxxxx]]{master=true}[osm][2] org.elasticsearch.transport.NodeDisconnectedException: [elastic-osm- node-1][inet[/xxxxxxx]][indices:data/write/bulk[s][r]] 断开连接 ....