How we Migrated Elasticsearch from 2.3 to 6.8

Sat, May 8, 2021 6-minute read

Expectations

Our expectations were adjusted to avoid data-loss, avoid bugs as much as possible, and allow us to migrate everything while our users continue working as they should.

  1. 0% Downtime
  2. 100% Rollback Safe
  3. 100% Data Corruption safe
  4. Bug Free

Zero Downtime

Or almost zero, right? This ES version is old, but still the core of our system, and we don’t want to stop our system, neither make our users wait while do this migration, because of that we need zero downtime or almost zero.

Rollback safe

We should be able to rollback everything even after migration process.

Data Corruption safe

We MUST NOT lose any data, because we have the core system running in this old version.

Bug free

Less bug as possible, but everybody knows that any change brings bugs, so we MUST test this change as much as possible and if any bug occurs, we should be able to fix it and if it’s not fixable, we have our goal of “Rollback safe”, right?

What we need

Temporary ES 2.

This ES serves as intermediary between Old ES 2.3(running actually) and New ES 6.8. We will use this server to restore our data and be able to “sync” our data in new ES 6.8 without loose data, capacity and keep both elastic search running with ease.

Final ES 6.

Final data storage where we plan to keep using on Live to store our project data.

Step-by-Step of migration

If it defines the step-by-step in short would be:

  • Upgrade project libraries to use Ruflin/Elastica 6.*
  • Upgrade project to use both, new and old deployment process
  • Upgrade ES Mapping to be compatible with version 6.
  • Create script to migrate data from ES 2.3 to ES 6.
  • Make the ES Cluster Version 6.8 available — production ready, including monitoring and logs —
  • Implement Double Writing to ES 2.3 & ES 6.
  • ___2 days delay —— waiting to have written data in the new Cluster.
  • Generate Snapshot from Production Data for all index on old ES 2.3 cluster.
  • This can also be done copying the data directly from filesystem
  • Restore this snapshot
  • At this time we can already start reindex process to copy the whole data from ES 2.
  • Start Integrity Check to check every document in after reindex completed
  • Change read access to use version 6.8 and keep writing in both
  • Any bugs? Yes: fix it. No: We are done.
  • Remove the Second Writing call from the project
  • Stop old ES 2.3 removing it from Supervisor as well

Below you will find more detailed information.

Updates needed in the code

We need to make some updates in the project code to support new version of ES.

  • Upgrade project libraries to use Ruflin/Elastica 6.*
  • Upgrade project to use both, new and old deployment process
  • Upgrade ES Mapping to be compatible with version 6.

Preparation to move data

The most important part here is to have a production ready ES 6.8 cluster(including monitoring and logs available). After that we can proceed to create the scripts.

At this moment we need to create a script to migrate our ES data to a newer version. This script MUST store the history of what happened, and MUST be safe to run any time, keeping the version of “OLD ES 2.3”, the new 6.8 MUST NOT control the version at this moment of the migration.

To create this script we will use the Reindex API for that

Remote Reindex call example:

Endpoint of reindex API

# this is a call to reindex API
curl -X POST \
http://NEW_ES_VERSION:9200/_reindex \
-H 'Authorization: Basic YWRtaW46YWRtaW4=' \
-H 'Content-Type: application/json' \
-H 'cache-control: no-cache' \
-d '{
"conflicts": "proceed",
"source": {
"remote": {
"host": "http://OLD_ES_VERSION:9500"
},
"index": "index-*"
},
"dest": {
"index": "index-abc",
"version_type": "external",
"op_type": "create"
}
}
'
# "op_type": "create" = this WILL only create documents, without update any.
# "conflicts": "proceed" = In case of conflict, continue processing the documents.
# "version_type": "external" = This will keep the version from OLD ES, without change it.

Ex. response:


{
    "took": 1842,
    "timed_out": false,
    "total": 10000,
    "updated": 0,
    "created": 0,
    "deleted": 0,
    "batches": 10,
    "version_conflicts": 10000,
    "noops": 0,
    "retries": {
        "bulk": 0,
        "search": 0
    },
    "throttled_millis": 0,
    "requests_per_second": -1,
    "throttled_until_millis": 0,
    "failures": []
}

Implement Dual Writing to ES 2.3 & ES 6.

At this moment we will have the ES 6.8 cluster running waiting for connections, we will have the actual ES 2.3 running, so we will implement the double writing in our project, to keep both ES Synced for any change in any document from that time onward.

So that means we will have:

We are going to keep this double writing until some days AFTER migration complete. The double writing is going to keep our data synced between both servers, so we will have a restore point any time we want.

In the day after the double writing starts we will create a snapshot of the actual ES cluster — This process may take some time —, as soon as this snapshot finishes we can follow to the next step: Reindex the whole data.

Reindexing whole data

This process we are going to keep the Actual ES running in Read & Write, in the meantime we are going to restore a snapshot in our Temporary ES 2. Cluster — This process may take some time.

After the restoration is complete we are going to reindex the whole data from ES 2.3 (Temp) to ES 6.8 (New) using our script defined at “Preparation to move data” — This process may take some time.

Check restoration results

The result check was designed to use “Scroll API” from Elasticsearch to go though each possible document in the cluster per index, always searching in the temporary cluster. After response from scroll api we compare each document one by one.

We should check this result strictly in order to not miss any document and have missing data

Enabling ES 6.8 as actual

At this moment we will have actual ES 2.3 Running as READ & WRITE, ES 6.8 as WRITE, and we will have an ES 2.3 in stand-by.

We will switch the READ & WRITE from the Actual ES 2.3 to ES 6.8, at this moment the actual will be ES 6.8.

Checking moving changes

This time is the time to keep checking the monitoring available and all possible logs.

Any bugs? Yes — fix it. No — We are done.

Cleaning up servers

  1. Remove the Second Writing call from the project.
  2. Backup the data
  3. Stop ES 2.3 - Don’t kill it yet.