诊断Elasticsearch中的分片分配问题

1. Overview

In this tutorial, we’ll look at the problem with unallocated shards in Elasticsearch and how to fix it.

2. Elasticsearch Shard Allocation Process

Elasticsearch is a distributed search and analytics engine built on the open-source library Apache Lucence. To support large volumes of data through horizontal scaling, Elasticsearch stores the data of an index in one or more shards.

The Elasticsearch engine runs a shard allocation process that ensures the optimal placement of the shards in the cluster. Specifically, it monitors the cluster status and places the shards in different nodes in the cluster to achieve several objectives.

One of the objectives of the allocation process is to ensure cluster resources are optimally utilized. For example, the process ensures nodes newly added to the cluster will get assigned shards to distribute the load. Similarly, there’ll be reallocation when some nodes leave the system.

Besides that, the shard allocation process will place replica shards in a way that ensures the cluster’s fault tolerance. Specifically, the replica shard will not be assigned to the same node as the primary shards.

Furthermore, we can apply custom allocation rules to control the shard allocation. For example, we can use the node filter to exclude certain nodes from allocation.

3. Detecting Unallocated Shards

Unallocated shards in an Elasticsearch cluster can significantly impact its health and functionality. Therefore we must promptly detect any unallocated shards in our cluster.

We can identify unallocated cluster shards using the cat shard API. Specifically, we can send a GET request to the _cat/shards endpoint to get the details of our shards allocation:

$ curl -X GET my-elasticsearch-cluster:9200/_cat/shards?v&h=index,shard,pirep,state,unassigned.reason
index      shard prirep state      unassigned.reason
log-entries   0     p      UNASSIGNED INDEX_CREATED
log-entries   1     p      STARTED    
log-entries   2     p      UNASSIGNED NODE_LEFT
log-entries   3     p      STARTED    
log-entries   4     p      STARTED

The v query parameter includes the header columns in the output. Then, the h query parameter specifies the desired output columns.

The index and shard columns tell us the name of the index and the shard number respectively. Then, the prirep column indicates primary shards with p and replica shards with r. Next, the state column tells the state of the shard. Finally, the unassigned.reason indicates why the shard is unassigned.

Importantly, the unassigned.reason tells us why a shard is being reallocated but doesn’t explain why the shard is currently unassigned.

4. Diagnosing the Root Cause of Unassigned Shards

The Cluster Allocation Explain API provides detailed information about shard allocations. Using this API, we can identify the root cause of prolonged unallocated shards.

To get the allocation explanation of a shard, we pass the index and shard number to the request body:

$ curl -X POST "http://my-elasticsearch-cluster:9200/_cluster/allocation/explain -H 'Content-Type: application/json' -d
{
  "index": "log-entries"
  "shard": 2,
  "primary": true
}

In the command above, we request the allocation explanation on the primary shard number 2 of the index log-entries. The API then returns a response containing various details for diagnosing shard allocation issues:

{
  "index" : "log-entries",
  "shard" : 2,
  "primary" : true,
  "current_state" : "unassigned",
  "unassigned_info" : {
    "reason" : "ALLOCATION_FAILED",
    "at" : "2024-08-03T14:30:45.123Z",
    "last_allocation_status" : "no_attempt"
  },
  "can_allocate" : "yes/no",
  "allocate_explanation" : "...",
  "node_allocation_decisions" : [
    {
      "node_id" : "node1",
      "node_name" : "node-1",
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "yes/no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "...",
          "decision" : "...",
          "explanation" : "..."
        }
      ]
    }
    ...
  ]
}

Let’s break down the response structure to understand the meaning of the fields.

Firstly, the unassigned_info contains the reason for the unassigned state of the shard. It’s important to note that not all unassigned shards are facing issues. Some shards are unallocated due to transient reasons such as relocation in progress. It’s only a problem when the unassigned_info.reason is saying ALLOCATION_FAILED.

Then, the allocate_explanation indicates the reason why an allocation doesn’t happen. This field is the most helpful in determining the exact root cause of the allocation failure. The explanation text in this field varies according to the failure reason.

Finally, the node_allocation_decisions shows the node’s decisions when asked to allocate the shard. It shows the individual nodes’ decision when asked to allocate the shard. This information is more granular and could help if the allocate_explanation alone doesn’t provide sufficient details.

4.1. Unallocated Shard Due to Disk Usage

The first common problem associated with unallocated shards is disk usage issues. Elasticsearch uses different watermark levels to track the disk usage on the nodes. Even with leftover disk space, Elasticsearch can reject placing shards on the node when its usage exceeds a certain threshold. This ensures the node has sufficient disk space for critical maintenance work to prevent disrupting the stability of the cluster.

For unallocated shards due to insufficient disk space, the explanation API indicates the reason in the response’s allocate_explanation field:

{
  ...
  "can_allocate" : "no",
  "allocate_explanation" : "cannot allocate because all nodes' disk usage exceeds the watermark",
  "node_allocation_decisions" : [
    {
      "node_id" : "node1",
      "node_name" : "node-1",
      "transport_address" : "127.0.0.1:9300",
      "node_decision" : "no",
      "weight_ranking" : 1,
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=85%], using more disk space than the maximum allowed [85.0%], actual free: [10.8%]"
        }
      ]
    },
    {
      "node_id" : "node2",
      ...,
      "node_decision" : "no",
      "weight_ranking" : 2,
      "deciders" : [
        {
          "decider" : "disk_threshold",
          "decision" : "NO",
          "explanation" : "the node is above the high watermark cluster setting [cluster.routing.allocation.disk.watermark.high=85%], using more disk space than the maximum allowed [85.0%], actual free: [12.3%]"
        }
      ]
    }
  ]
}

To solve this problem, we need to add more disk capacity to the cluster. We can achieve that by adding more nodes or increasing the disk capacity of each node in the cluster.

4.2. Replica Placement

A replica placement issue arises when the engine cannot place a replica shard on a node because the node already holds the primary shard. This can happen when the number of replicas exceeds the number of nodes in the cluster.

If a replica shard remains unassigned due to the replica placement rule, the allocate_explanation field will show relevant messages:

{
  "index": "log-entries",
  "shard": 2,
  "primary": false,
...
  "can_allocate": "no",
  "allocate_explanation": "cannot allocate because a primary shard is already allocated on the same node",
...
}

To solve the replica placement issue, we must match the number of nodes in our cluster with the replica counts. Specifically, we can either increase the number of nodes or reduce the number of replicas.

4.3. Other Common Causes

Incompatible allocation rules on the index allocation routing config are another common cause. For example:

{
  ...
  "allocate_explanation" : "cannot allocate because all nodes do not match index setting [index.routing.allocation.include] filters [datacenter:east]",
  ...
}

The message above indicates a rule restricting that the shard must be placed in a node with the attribute datacenter:east. However, there’s no node in the cluster matching that attribute. It’s important to constantly verify the index and cluster allocation routing rules to ensure they are valid.

In Elasticsearch, the index.routing.allocation.total_shards_per_nodes and cluster.routing.allocation.total_shards_per_node limits the maximum amount of shards per node. When the max shards per node limit is hit, the node will not accept more shards allocation:

{
  ...
  "allocate_explanation" : "cannot allocate because allocation would exceed the total number of shards per node limit"
  ...
}

5. Conclusion

In this tutorial, we’ve learned that Elasticsearch stores every index as one or more shards that can be distributed. Then, we’ve demonstrated how to get a list of unassigned shards using the cat shard API.

Subsequently, we’ve demonstrated the Cluster Allocation Explain API for diagnosing the root cause of unallocated shards. Finally, we’ve seen several common errors with shard allocations and how they manifest in the Cluster Allocation Explain API’s response.

Persistence

REST

Security