Elasticsearch: Disaster Recovery Plan

This document outlines the disaster recovery plan for an Elasticsearch cluster with 3 nodes. It covers procedures to ensure high availability, data protection, and recovery from failure scenarios.

Before proceeding with the following steps, connect to the target machine or use the machine’s external IP address (if you have the necessary permissions).

Cluster Architecture Overview

  • Total Nodes: 3 master-eligible nodes.

  • Quorum Requirement: A majority of nodes (2 out of 3) must be available for the cluster to remain functional.

  • Snapshots:

    • Configured for regular backups to a remote location (DigitalOcean Spaces)

    • Configured for regular backups of the virtual machines (droplets)

Common Failure Scenarios and Recovery Steps

1. One Node Fails

Impact:

  • The cluster remains operational.

  • Quorum is maintained (2 out of 3 nodes).

Recovery Steps:

  1. Identify the failed node using monitoring tools (Kibana, Elastic Monitoring API).

    curl -u "user:password" -X GET "http://localhost:9200/_cat/nodes?v&pretty"
  2. Investigate the issue (droplet failure, network outage,or container crash).

  3. Restore the failed node:

    • Restore Droplet from the snapshot in DigitalOcean (if needed)

    • Restart the Elasticsearch container.

    • Verify the node rejoins the cluster using GET _cat/nodes.

  4. Monitor shard reallocation to ensure data is rebalanced across nodes.

2. Two Nodes Fail

Impact:

  • The cluster becomes unavailable.

  • Quorum is lost, and no master node can be elected.

  • Writes and reads are not possible until quorum is restored.

Recovery Steps:

  1. Determine which nodes are offline.

  2. Check the logs for errors docker logs <container>

  3. Restore at least one of the failed nodes:

    • Restore Droplet from the snapshot in DigitalOcean (if needed)

    • Restart the nodes using docker-compose up.

  4. Once at least 2 nodes are operational, verify the cluster state:

    • Use curl -X GET "http://127.0.0.1:9200/_cluster/health?pretty".

    • Ensure the cluster status moves from red to yellow or green.

  5. Begin the re-indexing process in Elasticsearch:

    • connect to the API container in docker swarm cluster

      
      
    • execute the following commands:

      php artisan scout:import -c 500 Modules\\Profile\\Models\\Profile
      php artisan scout:import -c 500 Modules\\Vacancy\\Models\\Vacancy
      php artisan scout:import -c 500 Modules\\Publication\\Models\\Publication
  6. Investigate and fix the cause of failure to prevent recurrence.

3. Data Loss or Corruption

Impact:

  • Data stored in the cluster is partially or fully lost.

Recovery Steps:

  1. Verify the availability of the latest snapshot:

    • Get the list available snapshots

      curl -u "user:password" -X GET "http://localhost:9200/_cat/snapshots/?v&s=id&pretty"
  2. Restore the data:

  3. Monitor the restoration process:

    • Use GET _cat/recovery to track shard recovery progress.

  4. Begin the re-indexing process in Elasticsearch.


Restoring Data During Different Scenarios

1. Restore Data from an Elasticsearch Snapshot

  • Only the Elasticsearch data is restored, without impacting the droplets themselves.

  • This approach is useful for recovering data from logical issues or accidental deletion.

2. Restore Droplet

  • Restores the entire virtual machine, including Elasticsearch configuration, data, and other services on the droplet.

  • Always restore Elasticsearch snapshots before starting the service on the restored droplet to avoid inconsistencies. But, if the cluster detects that the restored node’s data is outdated, it will re-synchronize the latest data from other nodes to maintain consistency.

  • Suitable for recovering from full droplet corruption.

3. Create a New Droplet and Add it to the Existing Cluster

  • A new node can be added by copying the docker-compose.yml file from an existing node and updating the .env file.

  • The new node will sync data and settings automatically from the cluster.

  • Useful for scaling the cluster or replacing a permanently failed droplet.

Comments

Leave a Reply