Managing Elasticsearch Cluster Restart Time

While building a fairly large index (8TB total for 500 million docs), I ran into some very long restart times for the cluster. That prompted me to start a discussion about long restart times. There’s some good discussion in that thread, and I wanted to write a post to summarize what we are doing to deal with long restart times.

By “long restart times”, I don’t mean that Elasticsearch didn’t start up quickly, but rather it spent a very long time recovering shards. In my logs I would see messages such as:

recovered_files [399] with total_size of [42.2gb], took [12.4m], throttling_wait [0s]#012         : reusing_files   [0] with total_size of [0b]

All of the data for a 42 GB shard was being recovered from one of the peer nodes rather than from the local disk.

In that ES user group thread, Zachary Tong has a good example and description of why Elasticsearch nodes can have such long restart times. The key point is:

The segment creation and merging process is not deterministic between nodes.

This means that as indexing occurs the segments for the same shard on different nodes will necessarily diverge from each other. When shard recovery occurs (such as during a restart) the segments that are different will need to be copied over the network rather than recovered from the disk.

We’ve put in place a couple of practices to try and minimize how much impact these slow restarts have on us.

First, some background on our system:

  • 500 million documents in 175 shards (about 8TB including replication)
  • 1.5 million new docs a day.
  • Heavy reindexing/updating of newer documents running 24/7, but 99% of older documents never change.
  • Bulk indexing is only ever performed when adding new fields to the index or adding new features.
  • With our recent launch of Related Posts we peak at about a million queries an hour. (More on scaling in a future post.)

Our current methods of minimizing cluster restart times:

1. After bulk indexing we perform the following:
  • Optimize all indices and set the max segments to 5.
  • Perform a rolling restart of the cluster (last one took 38 hours to complete)

By optimizing the index into a smaller number of segments we significantly decrease query time for the older documents. Also, since most of our data never changes we ensure that most of our data will be in large segments that should stay in sync across the cluster. They are less likely to be merged because they are already big.

The rolling restart of the cluster after bulk indexing ensures that all nodes have identical segments. By incurring the cost of restarting just after bulk indexing, we ensure that if a real issue comes up later that requires restarting then we will be able to restart more quickly.

Our current rolling restart only restarts a single node at a time. Because we are using shard allocation awareness we could increase the number of nodes we restart at once if we want to reduce the total time to restart the cluster, but that would also reduce our capacity for servicing incoming queries.

2. When doing a rolling restart, disable allocation
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation": true}}'

This ensures that there will not be a lot of thrashing of shards around the cluster as the nodes are restarted. Otherwise when we shutdown a node the cluster will try and try and allocate the shards on that node onto other nodes. Once the node is back up, you can then set this value to false.

3. Use noop Linux scheduling  on SSDs rather then CFQ for a significant speed up

When we tested making this update we saw the node restart time (from shutting the node down to all shards being recovered) drop from an average of 150 seconds/node to 96 seconds/node. This was for the case where there was very little difference between the shards on different nodes. When you are doing a rolling restart of 30 nodes, that’s a really big difference. Props to Github for investigating the performance impact of the scheduler.

4. Increase the default recovery limits
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 15
indices.recovery.max_bytes_per_sec: 100mb
indices.recovery.concurrent_streams: 5

We’ve tried increasing the max_bytes_per_sec above 100mb, but that runs us into cases where the network traffic starts interfering with the query traffic. You will of course get different results depending on what hardware you are using. In general the ES defaults are set for Amazon EC2, so you can increase your limits a lot if you have your own hardware.

5. Periodic rolling restarts?

One thing I am considering is periodically doing a rolling restart of the cluster. Every few months or so. The only real reason to do this is that it will help me recover faster if I really have to do a restart due to some cluster or hardware failure. Though with the rate that new ES releases occur we’ll probably have a reason to perform such a restart periodically anyways. Not to mention the possibility of bulk reindexing in order to add new features.

I am curious how our restart time will change over time. I would theorize that since most of our data doesn’t change, that data will slowly get accumulated in the older, larger segments while the newer posts will be in the smaller, newer segments. For the most part it will be these newer segments that need to get recovered from the primary shard.

Building Word Clouds with Faceted Search

Elasticsearch’s faceted results are a great way to analyze the contents of a set of documents. For over a year now, Polldaddy has used Elasticsearch to create reports for the most popular answers and words given to free text survey responses. For more details take a look at the feature announcement.

However, running faceted search on such a wide array of user data can be difficult. Faceted Search in Elasticsearch can consume a lot of memory which leads to the suggestion in the ES documentation to “make sure the number of unique tokens a field can have is not large”. To make sure that we can accept any arbitrary user input we use a couple of tricks.

First let’s take a look at the mapping we use for documents in the polldaddy-survey index:

  "polldaddy-survey" : {
    "freetext" : {
      "_routing" : {
        "required" : true,
        "path" : "survey_id"
      "properties" : {
        "resp_id" : {
          "type" : "long",
        "survey_id" : {
          "type" : "long",
        "text" : {
          "type" : "string",
          "analyzer" : "analyzer_polldaddy",
        "text_string" : {
          "type" : "string",
          "index" : "not_analyzed",

Each user’s response is its own document where the response is stored in its analyzed form in the text field and unanalyzed in the text_string field. (I should have come up with better names.) By doing a terms facet query on these two fields we can get the most popular words and most popular answers respectively.

However, blindly doing this across the many millions of responses in Polldaddy would run into some serious memory problems due to the overall size of the vocabularies in those fields. For that reason we are using _routing to make sure that all documents related to a single survey go to the same shard. We then allocate a very large number of shards (100 in our case) to limit the number of unique terms in each shard. By routing our query only to one shard the amount of memory that needs to be allocated is greatly reduced, and we can even handle surveys with decent-sized vocabularies.

So just how important is routing to a single shard, well a bug snuck into our code at one point and disabled the routing. Here’s what happened to the cache memory consumption from when it was broken to when it got fixed.

Boom! Fixed a bug.

A pretty dramatic change. Without the routing to a single shard the cache would occasionally try to load a very large vocabulary and allocate 10+ GB. This of course would slow down all queries on the server.

But it could have been worse. Commenting on a previous post on this site Bruno asked me why I suggested setting index.cache.field.type: soft given that it reduces the caching performance. This memory consumption activity is why. Before setting the field cache to soft (and before adding the routing) these queries would sometimes consume so much memory that we would run out of the 24GB we had allocated on the servers. In fact it seemed like no matter how much memory we gave the servers, they would use it and cause OutOfMemory errors that would bring the cluster to a painful halt. Setting the field cache to soft is really the only way I can ensure that we won’t hit those conditions regardless of what data gets entered into a poll.

I will relish the day when there’s a good bug fix in ES for the term facet memory consumption (Issue #1531). There’s so many great applications that can be built on top of faceted searches, I’d just love to not have to worry about running out of memory because of a stray query.

A big thanks to Shay Bannon (lead ES developer) who originally suggested that I use routing and a large number of shards. And of course many thanks to my colleagues on the Polldaddy team.