一则关于在Elasticsearch中出现”No space left on device”错误导致UNASSIGNED分片的解决方法的故事

3 年 ago

新, 韵

4 minutes

这篇文章是“Elastic Stack（Elasticsearch）第2节日日历2019”的第20天的文章。

想要使用保存在Elasticsearch中的数据进行分析时…，我却发现自己被Elasticsearch的烦琐问题所困扰，@ysd_marrrr。

我正在维护一个现在看来设计不好的Elasticsearch服务器，其中的path.data包括了系统分区，在数据积累的过程中导致了系统分区的压力。现在我收到了关于系统分区空间不足的警报，并需要进行确认。

解决了关于分区空间的问题（前提条件），检查Elasticsearch的索引发现状态为RED。
由于集群存在，当其中一个节点宕机时，状态会变为RED，所以在尝试启动宕机节点时，发现集群内所有节点都是存活的?

看到红色指标的分片时，发现部分分片状态为“未分配”…?

$ curl "http://localhost:9200/_cat/shards/myindex1"
myindex1 1  p STARTED 4822406 33.5gb 10.127.110.1 elasticsearch-node1
myindex1 2  p STARTED 4818526 34.6gb 10.127.110.1 elasticsearch-node1
myindex1 3  p UNASSIGNED 4799590 33.3gb 10.127.110.2  elasticsearch-node2
myindex1 4  p STARTED 4824062 33.7gb 10.127.110.3  elasticsearch-node3
myindex1 5  p UNASSIGNED 4804203   34gb 10.127.110.2  elasticsearch-node2
myindex1 6  p STARTED 4824062 33.7gb 10.127.110.3  elasticsearch-node3

在分区空间不足导致分片脱离并进入RED状态的情况下，我用一个简单方法来解决问题而不删除数据，现在与大家分享。

环境：

我正在以下环境中确认。
* 由于涉及 Elasticsearch 5.x 系列，所以请自行获知在 6.x 及更高版本中使用 curl 发送 JSON 的约定。

$ curl "http://localhost:9200/"
{
  "version" : {
    "number" : "5.6.8",
    "lucene_version" : "6.6.1"
  },
  "tagline" : "You Know, for Search"
}

⚠ 我沒有設置複製，但考慮到「資料遺失會帶來相當大的損失」的情況。
⚠ 我認為這不適用於絕對不能刪除數據的生產環境。
⚠ 請在進行此操作之前先確保有足夠的儲存空間。

当使用”Elasticsearch UNASSIGNED” 进行搜索时，有一种方法显得更为突出，即牺牲分片。

当调查解决了”解决未分配问题”的案例时，突出的解决方法是”因为在可以安全删除的分片上发生了问题，因此我们将删除它并分配一个空的分片”。
虽然我们可以谈论复制品或备份，但删除分片可能会带来一些不便，因此我们在寻找其他方法。

在Elasticsearch 5版本中，如何通过强制修复未指派的分片状态（status red）来修复问题。

弹性搜索发生未分配的分片分配-笔记本

在_cluster/allocation/explain中确认 → 实际上有一个解决方案！！

当出现UNASSIGNED时，我首先使用此API来检查分片的分配情况。
然后，确实显示了 No space left on device。

$ curl "http://localhost:9200/_cluster/allocation/explain?pretty"
{
  "index" : "myindex1",
  "shard" : 6,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2019-12-01T19:50:00.027Z",
    "failed_allocation_attempts" : 5,
    "details" : "failed to create shard, failure IOException[No space left on device]",
    "last_allocation_status" : "no"
  },
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because allocation is not permitted to any of the nodes that hold an in-sync shard copy",
  "node_allocation_decisions" : [
    {
      "node_id" : "-CLqY8ecTdSkufWq0ba28w",
      "node_name" : "elasticsearch-node3",
      "transport_address" : "10.127.100.3:9300",
      "node_decision" : "no",
      "store" : {
        "in_sync" : true,
        "allocation_id" : "pMm2kOjhRHqJwGba2A7U3Q"
      },
      "deciders" : [
        {
          "decider" : "max_retry",
          "decision" : "NO",
          "explanation" : "shard has exceeded the maximum number of retries [5] on failed allocation attempts - manually call [/_cluster/reroute?retry_failed=true] to retry, [unassigned_info[[reason=ALLOCATION_FAILED], at[2019-12-01T19:50:00.027Z], failed_attempts[5], delayed=false, details[failed to create shard, failure IOException[No space left on device]], allocation_status[deciders_no]]]"
        }
      ]
    },
    {
      "node_id" : "7fhGQ7jQTTS0zTh5YI-GAg",
      "node_name" : "elasticsearch-node1",
      "transport_address" : "10.127.100.1:9300",
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    },
    {
      "node_id" : "s5J-96kgRraq1E56jHTv2Q",
      "node_name" : "elasticsearch-node2",
      "transport_address" : "10.127.100.2:9300",
      "node_decision" : "no",
      "store" : {
        "found" : false
      }
    }
  ]
}

然而，事实上，这个结果中随便写了一个解决方案。

“解释”：在失败的分配尝试上，分片超过了最大重试次数[5] – 手动调用[/_cluster/reroute?retry_failed=true] 进行重试， [未分配信息[[原因=分配失败]，于[2019-12-01T19:50:00.027Z]，失败尝试次数[5]，延迟=false，详细信息[无法创建分片，失败 IOException[设备上没有剩余空间]]，分配状态[决策者无]]]。

发出POST请求到指定的API。

$ curl -XPOST 'http://localhost:9200/_cluster/reroute?retry_failed=true&pretty'

顺利解决了✌

$ curl "http://localhost:9200/_cat/shards/myindex1"
myindex1 1  p STARTED 4822406 33.5gb 10.127.110.1 elasticsearch-node1
myindex1 2  p STARTED 4818526 34.6gb 10.127.110.1 elasticsearch-node1
myindex1 3  p STARTED 4799590 33.3gb 10.127.110.2  elasticsearch-node2
myindex1 4  p STARTED 4824062 33.7gb 10.127.110.3  elasticsearch-node3
myindex1 5  p STARTED 4804203   34gb 10.127.110.2  elasticsearch-node2
myindex1 6  p STARTED 4824062 33.7gb 10.127.110.3  elasticsearch-node3

我無法解決這次未分配的狀態。

我只能删除未分配的分片吗……我努力寻找其他解决方法，但没有找到答案。

/_cluster/reroute 的allocate操作无效。

在一些情况下，如果分片的分配处理不当，可能会导致其状态变为”UNASSIGNED”。然而，通过向/_cluster/reroute发送分配命令，可以解决这个问题。

在谷歌计算中出现未分配给的弹性搜索（Elasticsearch）碎片的分配情况 – 笔记本

Elasticsearch – 如何处理未分配的分片 – Stack Overflow
https://stackoverflow.com/questions/23656458/elasticsearch-what-to-do-with-unassigned-shards/23816954#23816954

让我们试试看吧。

$ curl -XPOST 'http://localhost:9200/_cluster/reroute?pretty' -d '{
  "commands": [{
    "allocate": {
      "index": "myindex1",
      "shard": 5,
      "node": "elasticsearch-node3",
      "allow_primary": true
    }
  }]
}'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "unknown_named_object_exception",
        "reason" : "Unknown AllocationCommand [allocate]",
        "line" : 3,
        "col" : 17
      }
    ],
    "type" : "parsing_exception",
    "reason" : "[cluster_reroute] failed to parse field [commands]",
    "line" : 3,
    "col" : 17,
    "caused_by" : {
      "type" : "unknown_named_object_exception",
      "reason" : "Unknown AllocationCommand [allocate]",
      "line" : 3,
      "col" : 17
    }
  },
  "status" : 400
}

分配，这种事情我可不清楚！然后他出来了，我感到很烦恼。?

“index.routing.allocation.disable_allocation”: false 不起作用

「我通过咨询Elasticsearch支持团队解决了这个问题！」的方法是将 “index.routing.allocation.disable_allocation” 设置为 false。

分片 – ElasticSearch：未分配的分片，如何修复？- Stack Overflow
https://stackoverflow.com/a/20010544

curl -XPUT 'localhost:9200/<index>/_settings' \
    -d '{"index.routing.allocation.disable_allocation": false}'

同样地，出现了”未知设置”。即使将变量实际改为已分配的索引，结果也相同。

$ curl -XPUT 'http://localhost:9200/_settings?pretty' -d ' {"index.routing.allocation.disable_allocation": false}'
{
  "error" : {
    "root_cause" : [
      {
        "type" : "illegal_argument_exception",
        "reason" : "unknown setting [index.routing.allocation.disable_allocation] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
      }
    ],
    "type" : "illegal_argument_exception",
    "reason" : "unknown setting [index.routing.allocation.disable_allocation] please check that any required plugins are installed, or check the breaking changes documentation for removed settings"
  },
  "status" : 400
}

看到上面的回答，它清楚地写着v0.90.x和之前的版本。我也尝试过 “cluster.routing.allocation.enable” ：”all”，但没有效果。

# v0.90.x and earlier
curl -XPUT 'localhost:9200/_settings' -d '{
    "index.routing.allocation.disable_allocation": false
}'

# v1.0+
curl -XPUT 'localhost:9200/_cluster/settings' -d '{
    "transient" : {
        "cluster.routing.allocation.enable" : "all"
    }
}'

关于水印的内容

根据某些情况，如果水印设置过低，可能不会分配碎片，因此我尝试更改了水印设置。
https://www.elastic.co/guide/en/elasticsearch/reference/current/disk-allocator.html

如果没有足够磁盘空间的节点，主节点可能无法分配分片（它不会将分片分配到已使用磁盘超过85%的节点）。

Reason 5: 磁盘空间不足
https://www.datadoghq.com/ja/blog/elasticsearch-unassigned-shards/

$ curl -XPUT 'http://localhost:9200/_cluster/settings' -d '{
    "transient": {  
          "cluster.routing.allocation.disk.watermark.low": "90%",
          "cluster.routing.allocation.disk.watermark.high": "95%"
    }
}'

更改水印后立即显示了消息，提醒”由于磁盘使用率超过了水印值，已迁移分片”。

[2019-12-02T15:16:49,472][WARN ][o.e.c.r.a.DiskThresholdMonitor] [elasticsearch-node1] 高磁盘水位线[90%]超过了[-CLqY8ecTdSkufWq0ba28w][elasticsearch-node1][/mnt/elasticsearch/data/nodes/0]的可用空间: 8.4gb[8.4%]，将会重新定位分片迁移离开此节点
[2019-12-02T15:16:49,472][INFO ][o.e.c.r.a.DiskThresholdMonitor] [elasticsearch-node1] 重新路由分片：[一个或多个节点的高磁盘水位线被超过]

如果出現了”No space left on device”的错误, 这表示由于缺乏分片移动，水印判定未能正确执行，原因可能是某种问题。
虽然这个设置是默认设置（低85%，高90%），但为什么在默认设置下无法正常工作还不清楚?。

并且在此之后，尝试执行 “/_cluster/reroute” 中的 allocate 等操作，但无法改变 UNASSIGNED 的状态。

最后

被_cluster/allocation/explain 助了真是太好了。
如果在无法从备份还原的情况下查找命令，却被告知“Unknown AllocationCommand [allocate]”，那会变得相当麻烦。所以，为了方便地还原，请准备好副本和备份?