Elasticsearch: The Broken Bits

A post like this carries a heavy risk of this:

So, some friendly caveats:

  • I wrote the first draft of this a bit after midnight in the middle of a week where we weathered 3 ES outages on two different clusters, none of which were the fault of ES, but all of which could have been much less painful.
  • We’re currently running ES 1.3.4. We’ve been using ES since 0.17, and it has gotten immensely better along the way. We couldn’t build many features without it.
  • I know many of these problems are being actively worked on, I’m publishing this in the hopes of providing more context to the series of github issues I’ve submitted.

With that out of the way, I think ES has some significant problems when it comes to operating and iterating on a large cluster with terabytes of data. I think these problems contribute to misunderstandings about ES and end up giving developers and sys admins a bad experience. I think many of the fixes (though not all) require rethinking default configurations rather than implementing complex algorithms.

Shard Recovery When a Node Fails

This sort of complaint about ES seems to come up regularly:

Reading the original issue, the confusion is around how the “elastic” part of ES moves shards around. I’ll quote:

  1. Cluster has 3 data nodes, A, B, and C. The index has a replica count of 1…
  2. A crashes. B takes over as master, and then starts transferring data to C as a new replica.
  3. B crashes. C is now master with an impartial dataset.
  4. There is a write to the index.
  5. A and B finally reboot, and they are told that they are now stale. Both A and B delete their local data.
  6. all the data A and B had which C never got is lost forever.

Two of three nodes failed back to back, and there were only 2 replicas of the data. ES devs correctly pointed out that you should expect to lose data. However, the user lost almost all of the existing data because the shard was moved from A to C. Almost all of the data existed on the disk of A, but it was deleted.

Elasticsearch is being too elastic. Its great to reshuffle data when nodes get added to a cluster. When a node fails though, reshuffling data is almost always a terrible idea. Let’s go through a timeline we see when an ES node fails in production:

  1. Cluster is happily processing millions of queries and index ops an hour…
  2. A node fails and drops out of the cluster
  3. The master allocates all the shards that were on that node to another node and starts recovery on those nodes.
  4. Meanwhile: alerts are fired to ops, pagers go off, people start fixing things.
  5. The shards being recovered start pulling terabytes of data over 1Gbit connections. Recovering all shards will take hours.
  6. Ops fixes whatever minor hardware problem occurred and the down node rejoins the cluster.
  7. All of the data on the down node gets tossed away. That node has nothing to do.
  8. Eventually all of the data is recovered.
  9. Slowly, shards are relocated to the down node
  10. About 10-16 hours later the cluster has finally recovered.

I’ve seen this happen at least a half dozen times. A single node going down triggers a half day event. Many nodes going down can take a day or two to recover.

There’s at least a couple of ways things can go very wrong because of the above flow:

  • Because it takes days to recover all the shards of your data, if there is another hardware failure (or human error) during this time then you are at risk of not having enough redundancy of your data. As it currently stands eventually we’ll catch the triple event within a week that will cause us to lose data.
  • Moving that much data across the network puts a huge strain on your infrastructure. You can easily cause other problems when there is already something going wrong. A router going down and bringing down part of your ES cluster is exactly the wrong time to move around lots of data and further increase your network load.
  • When a node fails its shards get allocated elsewhere. Very often this means that if enough nodes fail at once, then you can end up running out of disk space. When I’ve got 99 shards and very little disk space on a node why the %&#^ are you trying to put 40 more on that node? I’ve watched in slow motion horror as my disk space shrinks as I try and delete shards to force them onto other nodes with very little success.
  • Sometimes you won’t have enough replicas – oops, you made a mistake – but losing data does not need to be a binary event. Losing 5% of your data is not the same as losing 95%. Yes, you need to reindex in either case, but for many search applications you can work off a slightly stale index and users will never notice while you rebuild a new one.
  • Because a down node triggers moving shards around, you can’t just do a restart of your cluster, you need a procedure to restart your cluster. In our case that’s evolved into a 441 line bash script!

I could complain about this problem for a while, but you get the idea.

Why don’t we just recover the local data? By default ES should not move shards around when a node goes down. If something is wrong with the cluster, why take actions that will make the situation worse? I’ve slowly come to the conclusion that moving data around is the wrong approach.

Cluster Restart Times

Actual IRC transcript between me and our head of systems:

gibrown – restarting ES cluster
gibrown – will take three days, cluster will be yellow during that time
bazza – maybe we should make `deploy wpcom` take 3 days 🙂
bazza – then we could call ourselves Elastic WordPress!

Ouch… that stung. At WordPress.com we do about 100-200 deploys a day and each updates thousands of servers in less than 20 seconds. We optimize for iteration speed and fixing problems quickly because performance problems in particular are hard to predict. Elasticsearch development in comparison has been excruciatingly slow.

We recently did an upgrade from 1.2.1 to 1.3.4. It took 6 or 7 days to complete because ES replicas do not stay in sync with each other. As real time indexing continues the replicas each do their own indexing so when a node get restarted its shard recovery requires moving all of the modified segments from another node. After bulk indexing, this means moving your entire data set across the network once for each replica you have.

Recently, we’ve been expanding our indexing from the 800 mil docs we had to about 5 billion. I am not exaggerating when I say that we’ve spent at least two months out of the last twelve waiting for ES to restart after upgrading versions, changing our indexing, or trying to improve our configuration.

Resyncing across the network completely kills our productivity and how fast we can iterate. I’d go so far as to say it is our fundamental limitation to using ES in more applications. It feels like this syncing should at least partially happen in the backgorund whenever the cluster is green.

Field Data and Heap

So you want to sort 1000 of your 5 billion documents by date? Sure we can do that, just hold on a second while I load all 5 billion dates into memory…

I didn’t fully appreciate this limitation until just recently. Field data has been a thorn in ES developers sides from the beginning. It is the single easiest way to crash your cluster, and I bet everyone has done it. The new-ish breakers help a lot, but they still don’t let me do what I’m trying to do. I don’t need 5 billion dates to sort 1000 docs, I only need 1000.

I shouldn’t complain too loudly about this, because there is actually a solution! doc_values will store your fields on disk so that you don’t need to load all values into memory, only those that you need. They are a little slower, but slower is far better than non-functional. I think they should be the default.

Onward

Elasticsearch has evolved a lot over the past years, I think its time to rethink a number of default settings to improve the configuration when dealing with large amounts of data. A lot of this comes down to thinking about the user experience of Elasticsearch. A lot of thought has clearly been put into the user experience for a new user just starting out. That’s what makes ES so addictive. I think some more thought could be put into the experience of sys admins and developers who are iterating on large data sets.

Presentation: Elasticsearch at Automattic

I gave a presentation at the Elasticsearch Denver meetup last night covering some of the bigger changes we had to make to scale a cluster to handle all WordPress.com posts. This is a cherry picked list from my more comprehensive ongoing series of posts chronicling our experience scaling Elasticsearch.

Presentation slides don’t always translate well to the web since there is a lot said in person that is not on the slides. Hopefully they’ll still be useful for the broader points I covered.

 

2013 in review

My personal blogging goal for this past year was to actually have an Annual Report that I would not be ashamed to look at. Goal Achieved!

A pretty large part of my traffic is from search engines for some phrase with the word “Elasticsearch” in it, but I also surpassed 200 followers recently. I only managed to publish 9 times this year though. Will have to work on that.

You can check out the full report below.

Here’s an excerpt:

The concert hall at the Sydney Opera House holds 2,700 people. This blog was viewed about 24,000 times in 2013. If it were a concert at Sydney Opera House, it would take about 9 sold-out performances for that many people to see it.

Click here to see the complete report.

Now live: WordPress.com VIP Search

Really love the search interface that Alley Interactive built for the new KFF site. All powered by Elasticsearch (and WordPress of course) behind the scenes.

Enterprise WordPress hosting, support, and consulting - WordPress VIP

WordPress’s standard search features are capable and easy to use, but when you’re developing search-driven web applications with WordPress, you need a tool ready-made for that purpose. That’s why today we’re introducing our new WordPress.com VIP Search add-on, and are excited to debut it as part of the relaunch of the Kaiser Family Foundation here on WordPress.com VIP.

WordPress.com VIP Search is a new premium service for our Cloud Hosting customers that delivers the features and flexibility of the powerful elasticsearch software—all hosted, managed, and supported by the WordPress.com VIP team.

With VIP Search enabled, your search results will be more relevant and timely out of the box—but the real benefit is that developers can leverage this new functionality to deeply customize your search results, including support for faceted search. With faceted search features, your users can filter search results on your sites however you’d like—by type of content, category…

View original post 47 more words

In addition to a notifications system that spans across WordPress.com and WordPress sites with Jetpack installed; the new Jetpack also has a great JSON API.

Jetpack — Essential Security & Performance for WordPress

Jetpack 1.9 is here. That’s right, it’s time for another big helping of Jetpack awesomeness. This release brings you Toolbar Notifications, Mobile Push Notifications, Custom CSS for mobile themes, a JSON API, and improvements to the Contact Form.

Notifications adds a menu to your toolbar that lets you read, moderate, reply to comments from any page on your blog. Plus, if find yourself on TechCrunch, GigaOm, or any of the millions of other sites running on WordPress.com, you’ll be able to view and moderate comments on your own site from the toolbar there, too.

Mobile Push Notifications for iOS: Users who link their accounts to WordPress.com and use WordPress for iOS 3.2 can now get push notifications of comments.

Custom CSS can now be applied to mobile themes.

The WordPress.com REST API is now available in Jetpack. That means developers can build cool applications that interact with…

View original post 60 more words