Scaling Elasticsearch Part 2: Indexing

In part 1 I gave an overview of our cluster configuration. In this part we’ll dig into:

  • How our data is partitioned into indices to scale over time
  • Optimizing bulk indexing
  • Scaling real time indexing
  • How we manage indexing failures and downtime.

The details of our document mappings are mostly irrelevant for our indexing scaling discussion, so we’ll skip them until part 3.

Data Partitioning

Since data is constantly growing we need an indexing structure that can grow over time as well. A well-known limitation of ES is that once an index is created you cannot change the number of shards. The common solution to this problem is to recognize that searching across an index with 10 shards is identical to searching across 10 indices with 1 shard each, and indices can be created at will.

In our case we create one index for every 10 million blogs, with 25 shards per index. We use index templates so that as our system tries to index to a non-existent index the index is created dynamically.

There are a few factors that led to our index and sharding sizes:

  1. Uniform shard sizes: Shards should be of similar sizes so that you get mostly uniform response times. Larger shards take longer to query. We tried one index per 1 million blogs and found too much variation. For instance, when we migrated Microsoft’s LiveSpaces to we got a million or so blogs added to our DBs in a row created and they have remained pretty active. This variation drove us to put many blogs into each index. We rely on the hashing algorithm to spread the blogs across all the shards in the index.
  2. Limit number of shards per index: New shards are not instantly created when a new index is created. Technically, I guess, the cluster state is red for a very short period. At one point we tested 200 shards per index. In those cases we sometimes saw a few document indexing failures in our real time indexing because the primary shards were still being allocated across the cluster. There’s probably some other ways around this, but its something to look out for.
  3. Upper limit of shard size: Early on we tried indexing 10 million blogs per index with only 5 shards per index. Our initial testing went well, but then we found that the indices with the larger shards (the older blogs) were experiencing much longer query latencies. This was us starting to hit the upper limits of our shard sizes. The upper limit on shard size varies by what kind of data you have and is difficult to predict so it’s not surprising that we hit it.
  4. Minimize total number of shards: We’ll discuss this further in our next post on global queries, but as the number of shards increases the efficiency of the search decreases, so reducing the number of shards helps make global queries faster.

Like all fun engineering problems, there is no easy or obvious answer and the solution comes by guessing, testing, and eventually deciding that things are good enough. In our case we figured out that our maximum shard size was around 30 GB. We then created shards that were fairly large but which we don’t think would be able to grow to that maximum for many years.

As I’m writing this, and after a few months in production, we’re actually wondering if our shards are still too large. We didn’t take into account that deleted documents would also negatively affect shard size, and every time we reindex or update a document we effectively delete the old version. Investigation into this is still ongoing, so I’m not going to try to go into the details. The number of deleted documents in your shards is related to how much real time indexing you are doing, and the merge policy settings.

Bulk Indexing Practicalities

Bulk indexing speed is a major limit in how quickly we can iterate during development, and indexing will probably be one of our limiting factors in launching new features in the future. Faster bulk indexing means faster iteration time, more testing of different shard/index configurations, and more testing of query scaling.

For these reason we ended up putting a lot of effort into speeding up our bulk indexing. We went from bulk indexing taking about two months (estimated, we never actually ran this) to taking less than a week. Improving bulk indexing speed is very iterative and application specific. There are some obvious things to pay attention to like using the bulk indexing API, and testing different numbers of docs in each bulk API request. There were also a few things that surprised me:

  • Infrastructure Load: ES Indexing puts a heavy load on certain parts of the infrastructure because it pulls data from so many places. In the end, our indexing bottleneck is not ES itself, but actually other pieces of our infrastructure. I suppose we could throw more infrastructure at the problem, but that’s a trade off between how often you are going to bulk reindex vs how much that infrastructure will cost.
  • Extreme Corner Cases: For instance has millions of followers, likers, and lots of commenters. Building a list of these (and keeping it up to date with real time indexing) can be very costly – like, “oh $%#@ are we really trying to index a list of 2.3 million ids” costly (and then update it every few seconds).
  • Selective Bulk Indexing: Adding fields to the index requires updating the mappings and bulk reindexing all of our data. Having a way to selectively bulk index (find all blogs of a particular type) can speed up bulk indexing a lot.
  • Cluster Restarts: After bulk indexing we need to do a full rolling restart of the cluster.

I wish we had spent more time finding and implementing bulk indexing optimizations earlier in the project.

Scaling Real Time Indexing

During normal operation, our rate of indexing (20m+ document changes a day) has never really been a problem for our Elasticsearch cluster. Our real time indexing problems have mostly stemmed from combining so many pieces of information into each document that gathering the data can be a high load on our database tables.

Creating the correct triggers to detect when a user has changed a field is often non-trivial to implement in a way that won’t over index documents. There are posts that get new comments or likes every second. Rebuilding the entire document in those cases is a complete waste of time. We make heavy use of the Update API mostly to reduce the load of recreating ES documents from scratch.

The other times when real time indexing became a problem was when something went wrong with the cluster. A couple of examples:

  • Occasionally when shards are getting relocated or initialized on a number of nodes the network can get swamped which backs up the indexing jobs or can cause a high proportion of them to start failing.
  • Query load can become too high if a non-performant query is released into production.
  • We make mistakes (shocking!) and break a few servers.
  • Occasionally we find ES bugs in production. Particularly these have been around deleteByQuery, probably because we run a lot of them.
  • A router or server fails.

Real time indexing couples other portions of our infrastructure to our indexing. If indexing gets slowed down for some reason we can create a heavy load on our DBs, or on our jobs system that runs the ES indexing (and a lot of other, more important things).

In my opinion, scaling real time indexing comes down to two pieces:

  1. How do we manage downtime and performance problems in Elasticsearch and decouple it from our other systems.
  2. When indexing fails (which it will), how do we recover and avoid bulk indexing the whole data set.

Managing Downtime

We mentioned in the first post of this series that we mark ES nodes as down if we receive particular errors from them (such as a connection error). Naturally, if a node is down, then the system has less capacity for handling indexing operations. The more nodes that are down the less indexing we can handle.

We implemented some simple heuristics to reduce the indexing load when we start to detect that a server has been down for more than a few minutes. Once triggered, we queue certain indexing jobs for later processing by just storing the blog IDs in a DB table. The longer any nodes are down, the fewer types of jobs we allow. As soon as any problems are found we disable bulk indexing of entire blogs. If problems persist for more than 5 minutes we start to disable reindexing of entire posts, and eventually we also turn off any updating of documents or deletions.

Before implementing this indexing delay mechanism we had some cases where the real time indexing overwhelmed our system. Since implementing it we haven’t seen any, and we actually smoothly weathered a failure of one of the ES network routers while maintaining our full query load.

We of course also have some switches we can throw if we want to completely disable indexing and push all blog ids that need to be reindexed into our indexing queue.

Managing Indexing Failures

Managing failures means you need to define your goals for indexing:

  • Eventually ES will have the same data as the canonical data.
  • Minimize needing to bulk re-index everything.
  • Under normal operation the index is up to date within a minute.

There are a couple of different ways our indexing can fail:

  1. An individual indexing job crashes (eg an out of memory error when we try to index 2.3 million ids 🙂 ).
  2. An individual indexing job gets an error trying to index to ES.
  3. Our heuristics delay indexing by adding it to a queue
  4. We introduce a bug to our indexing that affects a small subset of posts.
  5. We turn off real time indexing.

We have a few mechanisms that deal with these different problems.

  • Problems 1, 3, and 5: The previously mentioned indexing queue. This lets us pick up the pieces after any failure and prevent bulk reindexing everything.
  • Problem 2: When indexing jobs fail they are retried using an exponential back off mechanism (5 min, 10 min, 20 min, 40 min, …).
  • Problem 4: We run scrolling queries against the index to find the set of blogs that would have been affected by a bug, and bulk index only those blogs.

It’s All About the Failures and Iteration

Looking back on what I wish we had done better, I think a recognition that indexing is all about handling the error conditions would have been the best advice I could have gotten. Getting systems for handling failures and for faster bulk indexing in place sooner would have helped a lot.

In the next part of the series I’ll talk about query performance and balancing global and local queries.

7 thoughts on “Scaling Elasticsearch Part 2: Indexing

  1. This post is amazing. thank you

    Can you tell me more about How and Why, did you do this? “In our case we create one index for every 10 million blogs, with 25 shards per index.”


    • Scalability.

      As we add more blogs over time we need to also add more shards so that the shards won’t get too large. We ended up at 10 million blogs per index in order to ensure that blogs were spread evenly across shards and to minimize the chance of hotspots. For instance, when imported about a million plus sites when Microsoft shut down Livespaces this resulted in a million blogs that were more active than average. They continue to stay more active, so grouping 10 million together helps balance that variation out.

      To implement it we use index templates to auto create the indices when a client indexes to them. So the client just has to be smart enough to select what index name to use (eg posts-10m-20m, posts-20m-30m, etc), and the index gets created on the fly.


  2. Hi Greg, Thanks for your post. When you sau 10M blogs/index, our document size will vary. In my case, the document is logs ranges in 500-1000bytes. The number of index should be decided by doc count or size. What I mean is 10M logs is relatively very less than 10M blogs. So any if harm if I have 500M logs/index, having 500bytes/log.


    • It is hard give advice based on document size because how the documents are analyzed and stored can greatly affect how much disk space. Also the total number of terms in analyzed fields can have a large effect on the index size.

      In our case we indexed a subset of our data (say 5 million blogs) and then looked at total bytes per million blogs to get a sense of how much total data we had. You need to be careful to stay below the maximum shard size. We found that above 20GB our shards started getting slow, and 40GB shards started producing real problems. I think the only way to really understand this limit is to test it with your real data, but maybe you can save yourself some time and try to keep your shards down in the 5-10GB range. I imagine what hardware you are using would also have an effect.

      One detail I only briefly touched on was deleted documents. As the shards get larger the individual segments within each shard get close to the max of 5GB. Reaching that maximum can cause an explosion of deleted documents (up to 50% in a segment) which means your shards will actually be larger than you expect. Our deleted documents (and hence shard size) increased for steadily for a couple of months before leveling off.

      My next post I plan to cover this case and a few others in more depth.


      • Thanks much for taking your time to reply Greg. Yesterday I could index ~600M docs(logs) – 171G – 10 partitions – 0 replicas – in 22hours – (single node – Linux-64G memory in which 32GB is allocated for elasticsearch JVM heap – 24 processors) in my ELK setup. Obviously the throughput reduced over time/index size. In my case, there is NO document deletion scenario. Based on your suggestion I will create new index for every 10G. My document will contain 10-15 fields and all are indexed and stored by default.

        Looking forward more blogs from you on Elasticsearch. Thanks.

        Liked by 1 person

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s