Most of this blog’s 40k visitors a year are looking at the epic Elasticsearch posts that I wrote years ago. For the most part they seem to still be relevant to people even if they are somewhat outdated. Here are my top posts with some commentary about each of them.
79% of my traffic comes from search engines, and almost 50% of all traffic goes to this one post. It’s actually kinda crazy that such a simple post gets so much of my traffic. I blame the clickbait headline. I have a bunch of long winded epic posts and what I should probably be writing is these small tidbits as they come up.
This is my all time favorite post. After 2.x and the removal of being able to specify an analyzer in a query it has become a bit outdated, but the overall concepts are still good. I love all the comments this post has generated. I’ve learned so much from this post and the discussions that it generated. We’ve accomplished a lot the past year to adjust our multi-lingual indexing (deployed edgengrams into an A/B test yesterday) and I’m hoping to write up what my latest thinking is soon.
3 and 4: Scaling Elasticsearch Series
The first two parts of this three part series are my third and fourth most popular posts. The indexing post is almost twice as popular as the intro and querying posts. Although these posts are almost three years old now they still describe pretty well how we scale most of our queries. Most of the reason why these posts haven’t been updated is because the methods they describe have worked really well for us.
The original post talks about having 600 million posts in the index and 23m queries a day. We now have 4.3 billion posts and do about 45m queries a day. That’s some good scaling.
Only over the past year have we started to see some problems slowly develop with our global cluster scaling. Currently the cluster runs fine for about a month or so and then heap usage creeps upwards until it starts to cause problems. The solution is just to do rolling restart of the cluster. Not pretty, but it works. Here’s what our average heap usage looks like broken down by data center for the past 30 days.
We think a lot of these are just memory management bugs in the old Elasticsearch version we have been running for years and are hopeful that as we transition to 2.x many of them will be resolved. The other option is just to add more servers which we haven’t done in a few years. Our typical load is not very high though until we reach the point of running out of heap so I haven’t felt very justified in ordering more servers for this cluster yet.
One high point of this cluster is it taught us how to run a multi data center cluster. Every cluster we deploy now is multi-data center and we have successfully survived cases where an entire data center goes down. Currently we are in three data centers spread across the US. It’s likely that in 2017 we will start trying to run intercontinental Elasticsearch clusters (Europe and the US). Should be exciting.
This post describes how we manage long restart times. 2.x is a bit faster in this regard, but still takes a while to synchronize, so this is still relevant to managing a production ES cluster.