Elasticsearch 版本冲突引擎异常

1. Overview

Elasticsearch is a powerful, distributed search and analytics engine built on Apache Lucene. Specifically, its distributed design makes it performant and scalable even when handling vast amounts of documents.

Among its multitude of features, Elasticsearch offers an optimistic concurrency control mechanism to ensure data integrity in highly concurrent access to documents.

In this tutorial, we’ll learn about the optimistic concurrency control in Elasticsearch and see how the version_conflict_engine_exception error relates to the optimistic concurrency control mechanism.

2. The Cause of version_conflict_engine_exception

In an Elasticsearch environment with high concurrent write traffic, we might find the version_conflict_engine_exception error logs:

{
  "error": {
    "root_cause": [
      {
        "type": "version_conflict_engine_exception",
        "reason": "[1]: version conflict, required seqNo [17], primary term [3]. current document has seqNo [18] and primary term [3]",
        "index_uuid": "QuhLhsE7Qh-Yc7XbgiPO1g",
        "shard": "0",
        "index": "mainstore"
      }
    ],
    "type": "version_conflict_engine_exception",
    "reason": "[1]: version conflict, required seqNo [17], primary term [3]. current document has seqNo [18] and primary term [3]",
    "index_uuid": "QuhLhsE7Qh-Yc7XbgiPO1g",
    "shard": "0",
    "index": "mainstore"
  },
  "status": 409
}

The log tells us that our operation has been rejected with a status code 409. Additionally, it provides an error message stating the cause of the exception. In the specific example, the engine refuses to update the document because the document has a larger seqNo than the requesting operation.

Essentially, the message tells us that the Elasticsearch engine rejects the request because its optimistic concurrency control mechanism detected a possible conflicting update. Let’s look at concurrency control in general.

3. Concurrency Control

In the context of data stores, concurrency control refers to the mechanisms for managing simultaneous operations without conflicts. This is particularly relevant to mutating operations like an update to an existing resource. The concurrency control mechanism helps prevent issues caused by conflicting updates, such as lost updates.

3.1. Lost Update

The lost update problem is a concurrency control problem when two or more transactions read the same data and then update it differently and simultaneously. As a result, one of the updates is overwritten by another, leading to the loss of the earlier update. Let’s look at an example.

Consider that we’re keeping track of the amount of stock in a document in our data store. Let’s assume our initial stock state, the stock count, is 100. Then, two separate, but concurrent processes wish to update the stock count:

lost update overview general

In our example, process A wants to add 10 units to the current stock count and process M wants to deduct 20 units from the current stock count. To do that, both processes will first retrieve the current value of the stock count, change the value, and persist the record. Without proper concurrency control, the final value is non-deterministic when both processes execute concurrently.

Here’s how the sequence of events takes place as time progresses:

lost update conflicting update timeline diagram

If process A writes last, the final value will be 110. On the other hand, the final value will be 80 if the deduct process writes last. In either case, the other update is effectively lost as the changes it has made are being overwritten. The problem with lost updates is that the program wouldn’t realize there’s a problem with the update, corrupting the data silently.

The solution to the lost update problem involves enforcing concurrency control on the resources.

3.2. Types of Concurrency Control

Generally, the concurrency control mechanisms can be grouped into two distinct types: Pessimistic concurrency control and optimistic concurrency control.

Pessimistic concurrency control typically requires the acquisition of an explicit lock before it can access a particular resource. When the lock to a resource is unavailable, the process will wait for it until the current owner releases the lock.

This ensures that access to the resource is serialized and there will be no lost update as the different processes always mutate and persist in the latest data. For example, in a relational database management system (RDBMS) such as PostgreSQL, we can lock a row using the FOR UPDATE syntax.

On the other hand, optimistic concurrency control uses a different model to ensure the correctness of data in a concurrent access context. As the name implies, optimistic concurrency control assumes an update is non-conflicting, unless proven otherwise. The proof typically involves a piece of versioning information on the data. Concretely, the version information is changed upon every mutation to the data.

Through the version value, the process can determine if it’s updating stale data. The Spring JPA @Version annotation is one example of an optimistic concurrency control mechanism.

The advantage of optimistic concurrency control is that it’s generally faster than the pessimistic concurrency control mechanism as there’s no overhead for acquiring and releasing a lock. Additionally, an optimistic concurrency control mechanism doesn’t run the risk of deadlock, unlike pessimistic concurrency control mechanisms.

4. Elasticsearch Optimistic Concurrency Control

Elasticsearch uses an optimistic concurrency control mechanism for controlling concurrent mutation to the documents in an index. Specifically, Elasticsearch uses the sequence number and primary term value to detect possible conflicting updates to the same index.

4.1. Sequence Number and Primary Term

The sequence number is a number that the primary shard that coordinates the change assigns to operations that are performed on an index. Every mutating operation on an index, such as adding a new document and updating an existing document will increment the sequence number.

The primary term is an integer value that monotonically increments whenever the primary shard is reassigned.

For each request, the Elasticsearch engine will return us the current values on the index. For example, Elasticsearch returns the sequence number and primary term number in the response body when we update an updating an existing document in an index:

$ curl -X POST "http://localhost:9200/mainstore/_update/1" -H 'Content-Type: application/json' -d"
  {
    \"doc\": {
    \"stock_count\": 100
    }
  }"
{
  "_index": "mainstore",
  "_id": "1",
  "_version": 3,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 2,
  "_primary_term": 1
}

In our example above, the current sequence number, _seq_no of the index mainstore is 2 and the primary term, _primary_term is 1. When we update the index again, Elasticsearch will increment the _seq_no value to indicate the sequence of operations done on the index:

$ curl -X POST "http://localhost:9200/mainstore/_update/1" -H 'Content-Type: application/json' -d"
  {
    \"doc\": {
    \"stock_count\": 100
    }
  }"
{
  "_index": "mainstore",
  "_id": "1",
  "_version": 4,
  "result": "updated",
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0
  },
  "_seq_no": 3,
  "_primary_term": 1
}

4.2. The Mechanism

The Elasticsearch optimistic concurrency controls first check if the primary term value of a request matches the index’s current value. When the values are different, the engine refuses the update request because the sequence number assigned might not be relevant anymore due to the occurrence of a primary shard reassignment.

If the primary term matches, the engine then checks if the sequence number of the request is greater than the document’s current sequence number. To ensure the index wasn’t changed between the time of document retrieval and the time of document update, the engine expects the sequence number of the update will be larger than the current index’s sequence number.

When the engine sees a lower sequence number in the update request than the document, it rejects the request because the request might’ve been operating on stale data.

4.3. Optimistic Concurrency Control in Action

To see the optimistic concurrency control in action, we write a bash function that updates a document:

$ update-document-function.sh
update_document() {
  curl "http://localhost:9200/mainstore/_update/1" -H 'Content-Type: application/json' -d"
  {
    \"doc\": {
    \"stock_count\": 100
    }
  }"
}

The update_document function uses the Update API to update document ID 1 in the mainstore index. Specifically, it replaces document ID 1 in the index with the content specified in the curl request body.

Then, we invoke the update_document function twice to simulate a concurrent update:

$ (update_document) &
(update_document) &
wait
...
{"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[1]: version conflict, required seqNo [17], primary term [3]. current document has seqNo [18] and primary term [3]"..."status":409}
{"_index":"mainstore","_id":"1","_version":19,"result":"updated","_shards":{"total":2,"successful":2,"failed":0},"_seq_no":18,"_primary_term":3}

The script above uses the ampersand operator to put both invocations in the background. This schedules the invocation to two different threads, which will be executed in parallel in a multi-processor system.

We can see from the truncated output that one request succeeded and another failed due to version_conflict_engine_exception.

4.4. Retry on Conflict

The Update API supports the retry_on_conflict query param to indicate the number of times it should retry the operation in case of a conflict exception. We can update our code above to retry at least three times before reporting it as an error:

$ cat update-document-function-retry.sh
update_document_retry() {
  curl "http://localhost:9200/mainstore/_update/1?retry_on_conflict=3" -H 'Content-Type: application/json' -d"
  {
    \"doc\": {
        \"stock_count\": 100
    }
  }"
}

5. Conclusion

In this tutorial, we’ve first learned about concurrency control. Then, we learned that concurrency control mechanisms solve the lost update problem by enforcing sequential resource access. Then, we learned about Elasticsearch’s optimistic concurrency control mechanism that utilizes the sequence number and primary term value.

Finally, we’ve seen the optimistic concurrency mechanism in action by simulating the version_conflict_engine_exception error.

Persistence

REST

Security