Managing Elasticsearch Cluster Restart Time

While building a fairly large index (8TB total for 500 million docs), I ran into some very long restart times for the cluster. That prompted me to start a discussion about long restart times. There’s some good discussion in that thread, and I wanted to write a post to summarize what we are doing to deal with long restart times.

By “long restart times”, I don’t mean that Elasticsearch didn’t start up quickly, but rather it spent a very long time recovering shards. In my logs I would see messages such as:

recovered_files [399] with total_size of [42.2gb], took [12.4m], throttling_wait [0s]#012         : reusing_files   [0] with total_size of [0b]

All of the data for a 42 GB shard was being recovered from one of the peer nodes rather than from the local disk.

In that ES user group thread, Zachary Tong has a good example and description of why Elasticsearch nodes can have such long restart times. The key point is:

The segment creation and merging process is not deterministic between nodes.

This means that as indexing occurs the segments for the same shard on different nodes will necessarily diverge from each other. When shard recovery occurs (such as during a restart) the segments that are different will need to be copied over the network rather than recovered from the disk.

We’ve put in place a couple of practices to try and minimize how much impact these slow restarts have on us.

First, some background on our system:

  • 500 million documents in 175 shards (about 8TB including replication)
  • 1.5 million new docs a day.
  • Heavy reindexing/updating of newer documents running 24/7, but 99% of older documents never change.
  • Bulk indexing is only ever performed when adding new fields to the index or adding new features.
  • With our recent launch of Related Posts we peak at about a million queries an hour. (More on scaling in a future post.)

Our current methods of minimizing cluster restart times:

1. After bulk indexing we perform the following:
  • Optimize all indices and set the max segments to 5.
  • Perform a rolling restart of the cluster (last one took 38 hours to complete)

By optimizing the index into a smaller number of segments we significantly decrease query time for the older documents. Also, since most of our data never changes we ensure that most of our data will be in large segments that should stay in sync across the cluster. They are less likely to be merged because they are already big.

The rolling restart of the cluster after bulk indexing ensures that all nodes have identical segments. By incurring the cost of restarting just after bulk indexing, we ensure that if a real issue comes up later that requires restarting then we will be able to restart more quickly.

Our current rolling restart only restarts a single node at a time. Because we are using shard allocation awareness we could increase the number of nodes we restart at once if we want to reduce the total time to restart the cluster, but that would also reduce our capacity for servicing incoming queries.

2. When doing a rolling restart, disable allocation
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation": true}}'

This ensures that there will not be a lot of thrashing of shards around the cluster as the nodes are restarted. Otherwise when we shutdown a node the cluster will try and try and allocate the shards on that node onto other nodes. Once the node is back up, you can then set this value to false.

3. Use noop Linux scheduling  on SSDs rather then CFQ for a significant speed up

When we tested making this update we saw the node restart time (from shutting the node down to all shards being recovered) drop from an average of 150 seconds/node to 96 seconds/node. This was for the case where there was very little difference between the shards on different nodes. When you are doing a rolling restart of 30 nodes, that’s a really big difference. Props to Github for investigating the performance impact of the scheduler.

4. Increase the default recovery limits
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 15
indices.recovery.max_bytes_per_sec: 100mb
indices.recovery.concurrent_streams: 5

We’ve tried increasing the max_bytes_per_sec above 100mb, but that runs us into cases where the network traffic starts interfering with the query traffic. You will of course get different results depending on what hardware you are using. In general the ES defaults are set for Amazon EC2, so you can increase your limits a lot if you have your own hardware.

5. Periodic rolling restarts?

One thing I am considering is periodically doing a rolling restart of the cluster. Every few months or so. The only real reason to do this is that it will help me recover faster if I really have to do a restart due to some cluster or hardware failure. Though with the rate that new ES releases occur we’ll probably have a reason to perform such a restart periodically anyways. Not to mention the possibility of bulk reindexing in order to add new features.

I am curious how our restart time will change over time. I would theorize that since most of our data doesn’t change, that data will slowly get accumulated in the older, larger segments while the newer posts will be in the smaller, newer segments. For the most part it will be these newer segments that need to get recovered from the primary shard.

11 thoughts on “Managing Elasticsearch Cluster Restart Time

  1. Hi Greg

    Nice article to help people on the problems of ES restart. We are trying to use ES for our purpose where in it mostly acts as a read-only source. Writes/new-documents are added at specific times. So in our use-case I wanted to understand if there is a need to change the max_segments of an index after the writes are done. My guess is a re-start of our cluster wouldn’t have a large impact in delays, cos most of the segments should be same. (Unless there is hardware problem/connectivity issues during our writes)

    Appreciate your response.



    • Its not strictly necessary to change max_segments. It can speed up the index however. We’ve actually moved away from running optimize because we’ve found that we had segments that were too large and we were hitting the maximum segment size in some cases which increases the number of deleted documents in the index and negatively affects performance.


      • Greg,

        What is the maximum segment size? Is that a lucene level issue? And when you say that documents are deleted, you don’t mean that data is being lost do you?


      • Ya max segment size is a Lucene issue. It can be adjusted I think, but from what I understand, doing so is generally not a good idea due to the performance of very large files.

        By deleted documents I mean the ones that we have deleted from the index. Even updating a document will delete the old document. Deleted docs are only marked as deleted, and then latter cleaned up when segments get merged.


  2. Greg,

    One another question I have is with respect to the parameter – discovery.zen.minimum_master_nodes as to the impact of restarting a node on this parameter. What if during a node restart, if this condition is not met – what could be the outcome ?



  3. Greg,

    Not directly related to topic, but: how many nodes in avarage do you have in cluster for “500 million documents in 175 shards (about 8TB including replication)”. How strong are that machines?



    • This depends a lot on your query load and real time indexing load. I described a lot of how our system was constructed in this overview.


  4. Thanks for a wonderful set of articles. Do the restarts help defragment indices from all the deleted documents?


    • Unfortunately no. Deleted documents are only removed when segments within a shard are merged together. There is pretty much always some overhead of deleted documents in your index.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s