1. Overview

ElasticSearch, a powerful distributed search and analytics engine, excels at ingesting and querying vast amounts of data. However, there comes a time when data needs to be removed, whether for compliance, storage optimization, or data accuracy reasons.

In this tutorial, we explore various methods for removing data from ElasticSearch, ranging from deleting individual documents to managing large-scale deletions in production environments.

2. Deleting Individual Documents

To begin with, ElasticSearch provides several ways to remove individual documents from an index.

2.1. Using the Delete API

To begin with, perhaps the simplest way to remove a single document from ElasticSearch is by using the Delete API. This method is ideal when we know the exact document ID and index name:

$ curl -X DELETE "localhost:9200/customers/_doc/1"
{"_index":"customers","_id":"1","_version":3,"result":"deleted","_shards":{"total":2,"successful":1,"failed":0},"_seq_no":20,"_primary_term":1}

In this example, customers* is the name of the index and 1 is the ID of the document we want to *DELETE.

When we execute this command, ElasticSearch attempts to delete the document with ID 1 from the customers index. Subsequently, if the document exists and is successfully deleted, ElasticSearch returns a JSON response indicating the operation was successful.

2.2. Deleting With a Query

On the other hand, when we need to delete multiple documents that match certain criteria, the Delete By Query API is more efficient. This method enables the removal of documents based on a query, similar to how we would search for documents:

$ curl -X POST "localhost:9200/customers/_delete_by_query" -H 'Content-Type: application/json' -d'
{
  "query": {
    "range": {
      "last_purchase_date": {
        "lt": "now-1y"
      }
    }
  }
}'
{"took":258,"timed_out":false,"total":4,"deleted":4,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[]}

Let’s break down this example:

  • we send a POST request to the _delete_by_query endpoint of the customers index
  • we use a range query to find all documents where the last_purchase_date field is less than one year ago from now
  • now-1y is an ElasticSearch date math expression meaning “one year ago from the current time”

This query deletes all customer documents where the last purchase was more than a year ago, hence, an efficient way to remove outdated or irrelevant data based on specific criteria.

However, there are some things we should note when using Delete By Query:

  • the operation is not atomic: if it fails midway, some documents may have been deleted while others remain
  • can be resource-intensive for large datasets: it’s usually best to run such operations during off-peak hours

We can also add a size parameter to limit the number of documents deleted in a single operation. This can further help manage the load on the cluster.

3. Bulk Deletion Operations

Moving on to more efficient methods for large-scale deletions, when dealing with a large number of documents, bulk operations can significantly improve performance. The Bulk API performs multiple delete operations in a single request, thus reducing network overhead and improving overall efficiency.

Let’s see an example of how to use the Bulk API for deletions using Python with the ElasticSearch client:

from elasticsearch import Elasticsearch, helpers

es = Elasticsearch(["http://localhost:9200"])

def generate_actions(inactive_customer_ids):
    for customer_id in inactive_customer_ids:
        yield {
            "_op_type": "delete",
            "_index": "customers",
            "_id": customer_id
        }

inactive_customer_ids = ["3", "5", "8"]

response = helpers.bulk(es, generate_actions(inactive_customer_ids))
print(f"Deleted {response[0]} documents")

First, we create an ElasticSearch client instance, connecting to the local ElasticSearch server. Then, we define a generator function generate_actions that yields delete actions for each customer ID. After that, we create a list of inactive customer IDs. In a real scenario, such a list might come from a database query or another data source.

Subsequently, we use the helpers.bulk() function to perform the bulk delete operation. Finally, we print the number of documents deleted.

Now, let’s run the script:

$ python3 bulk-removal.py 
Deleted 3 documents

The Bulk API is more efficient than sending individual delete requests for each document because it reduces the number of network round trips to the ElasticSearch cluster as well as the overhead during the actual internal operations.

4. Removing Data With Index Operations

In addition to document-level operations, sometimes we might need to remove larger chunks of data. In such cases, index-level operations can be more efficient.

4.1. Deleting an Entire Index

If we need to remove all data from an index, deleting the entire index is the fastest approach:

$ curl -X DELETE "localhost:9200/customers"
{"acknowledged":true}

This command deletes the customers index and all its data. Notably, it’s an extremely fast operation but it’s also irreversible.

This method is useful when managing time-based indices and want to remove old data. For example, this is a common way to delete the log index from last month.

4.2. Using Aliases for Zero-Downtime Reindexing

Alternatively, for a more nuanced approach that enables the removal of data while maintaining availability, we can use index aliases. This method is particularly useful when we want to remove a subset of data from an index without any downtime.

To start, we create the alias for the existing index:

$ curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
  "actions": [
    { "add": { "index": "customers", "alias": "current_customers" }}
  ]
}'
{"acknowledged":true,"errors":false}

Then, we create a new index with updated settings:

$ curl -X PUT "localhost:9200/customers_v2" -H 'Content-Type: application/json' -d'
{
  "mappings": {
    "properties": {
      "email": { "type": "keyword" },
      "name": { "type": "text" }
    }
  }
}'
{"acknowledged":true,"shards_acknowledged":true,"index":"customers_v2"}

Next, we reindex the data, excluding inactive customers:

$ curl -X POST "localhost:9200/_reindex" -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "customers",
    "query": {
      "bool": {
        "must_not": {
          "term": { "status": "inactive" }
        }
      }
    }
  },
  "dest": {
    "index": "customers_v2"
  }
}'
{"took":251,"timed_out":false,"total":7,"updated":0,"created":7,"deleted":0,"batches":1,"version_conflicts":0,"noops":0,"retries":{"bulk":0,"search":0},"throttled_millis":0,"requests_per_second":-1.0,"throttled_until_millis":0,"failures":[]}

Finally, we switch the alias to the new index:

$ curl -X POST "localhost:9200/_aliases" -H 'Content-Type: application/json' -d'
{
  "actions": [
    { "remove": { "index": "customers", "alias": "current_customers" }},
    { "add":    { "index": "customers_v2", "alias": "current_customers" }}
  ]
}'
{"acknowledged":true,"errors":false}

Using this method, applications can continue to read and write to the current_customers alias throughout the process. Once the reindexing is complete and the alias is switched, the old index can be deleted.

5. Conclusion

In this article, we explored various methods for removing data from ElasticSearch, ranging from deleting individual documents to managing large-scale deletions in production environments. We covered the use of Delete API, Delete By Query API, Bulk API, and index-level operations.

With these techniques, we can effectively manage data in ElasticSearch clusters, ensuring optimal performance, compliance with data retention policies, and efficient use of storage resources.