Top Five Posts from 2016

Most of this blog’s 40k visitors a year are looking at the epic Elasticsearch posts that I wrote years ago. For the most part they seem to still be relevant to people even if they are somewhat outdated. Here are my top posts with some commentary about each of them.

2014-emailteaser

1: Elasticsearch: Five Things I was Doing Wrong

79% of my traffic comes from search engines, and almost 50% of all traffic goes to this one post. It’s actually kinda crazy that such a simple post gets so much of my traffic. I blame the clickbait headline. I have a bunch of long winded epic posts and what I should probably be writing is these small tidbits as they come up.

2: Three Principles for Multilingal Indexing in Elasticsearch

This is my all time favorite post. After 2.x and the removal of being able to specify an analyzer in a query it has become a bit outdated, but the overall concepts are still good. I love all the comments this post has generated. I’ve learned so much from this post and the discussions that it generated. We’ve accomplished a lot the past year to adjust our multi-lingual indexing (deployed edgengrams into an A/B test yesterday) and I’m hoping to write up what my latest thinking is soon.

3 and 4: Scaling Elasticsearch Series

The first two parts of this three part series are my third and fourth most popular posts. The indexing post is almost twice as popular as the intro and querying posts. Although these posts are almost three years old now they still describe pretty well how we scale most of our queries. Most of the reason why these posts haven’t been updated is because the methods they describe have worked really well for us.

The original post talks about having 600 million posts in the index and 23m queries a day. We now have 4.3 billion posts and do about 45m queries a day. That’s some good scaling.

Only over the past year have we started to see some problems slowly develop with our global cluster scaling. Currently the cluster runs fine for about a month or so and then heap usage creeps upwards until it starts to cause problems. The solution is just to do rolling restart of the cluster. Not pretty, but it works. Here’s what our average heap usage looks like broken down by data center for the past 30 days.

Screen Shot 2017-01-06 at 9.28.48 AM.png

We think a lot of these are just memory management bugs in the old Elasticsearch version we have been running for years and are hopeful that as we transition to 2.x many of them will be resolved. The other option is just to add more servers which we haven’t done in a few years. Our typical load is not very high though until we reach the point of running out of heap so I haven’t felt very justified in ordering more servers for this cluster yet.

One high point of this cluster is it taught us how to run a multi data center cluster. Every cluster we deploy now is multi-data center and we have successfully survived cases where an entire data center goes down. Currently we are in three data centers spread across the US. It’s likely that in 2017 we will start trying to run intercontinental Elasticsearch clusters (Europe and the US). Should be exciting.

5: Managing Elasticsearch Cluster Restart Time

This post describes how we manage long restart times. 2.x is a bit faster in this regard, but still takes a while to synchronize, so this is still relevant to managing a production ES cluster.

 

 

Notes from The Printing Revolution

I’ve been reading The Printing Revolution in Early Modern Europe by Elizabeth Eisenstein on the transition of Europe from a scribal culture to a printing culture. In referencing Michael Clapham:

A man born in 1453, the year of the fall of Constantinople, could look back from his fiftieth year on a lifetime in which about eight million books had been printed, more perhaps than all the scribes of Europe had produced since Constantine founded his city in A.D. 330.

The printing press had an immense and hard to correlate impact on the past 600 years. How will we build the tools and discover the norms that will shape the next 600 years? Here are some quotes that jumped out at me from the third chapter on the “features of print culture”.

“Increased output and altered intake”

To consult different books it was no longer so essential to be a wandering scholar. Successive generations of sedentary scholars were less apt to be engrossed by a single text and expend their energies in elaborating on it. The era of the glossator and commentator came to an end, and a new “era of intense cross referencing between one book and another” began.

Merely by making more scrambled data available, by increasing the output of Aristotelian, Alexandrian, and Arabic texts, printers encouraged efforts to unscramble these data. Some medieval coastal maps had long been more accurate than many ancient ones, but few eyes had seen either.

Contradictions became more visible, divergent traditions more difficult to reconcile.

Printing encouraged forms of combinatory activity which were social as well as intellectual. It changed relationships between men of learning as well as between systems of ideas.

The new wide-angled, unfocused scholarship went together with a new single-minded, narrowly focused piety. At the same time, practical guidebooks and manuals also became more abundant, making it easier to lay plans for getting ahead in this world – possibly diverting attention from uncertain futures in the next one.

“Considering some effects produced by standardization”

The very act of publishing errata demonstrated a new capacity to locate textual errors with precision and to transmit this information simultaneously to scattered readers.

Sixteenth-century publications not only spread identical fashions but also encouraged the collection of diverse ones.

Concepts pertaining to uniformity and to diversity – to the typical and to the unique – are interdependent. They represent two sides of the same coin. In this regard one might consider the emergence of a new sense of individualism as a by-product of the new forms of standardization.

It’s interesting to think of printing and the uniformity it spread as the birth of individualism. Does the internet, by connecting highly dispersed but like minded people into tight niches bring about a reduction of individualism? Bubbles of conformity within your tribe?

no precedent existed for addressing a large crowd of people who were not gathered together in one place but were scattered in separate dwellings and who, as solitary individuals with divergent interests, were more receptive to intimate interchanges than to broad-gauged rhetorical effects.

There is simply no equivalent in scribal culture for the “avalanche” of “how-to” books which poured off the new presses

“Reorganizing texts and reference guides: rationalizing, codifying, and cataloguing data”

printers with regard to layout and presentation probably helped to reorganize the thinking of readers.

Basic changes in book format might well lead to changes in thought patterns… For example, printed reference works encouraged a repeated recourse to alphabetical order.

Alphabetical ordering. The simplest of sorting algorithms. Today it seems that has been taken by reverse chronological sorting. Only the most recent thing is important.

“From the corrupted copy to the improved edition”

A sequence of printed herbals beginning in the 1480s and going to 1526 reveals a “steady increase in the amount of distortion,” with the final product – an English herbal of 1526 – providing a “remarkably sad example of what happens to visual information as it passed from copyist to copyist.” … data tended to get garbled at an ever more rapid pace. But under the guidance of technically proficient masters, the new technology also provided a way of transcending the limits which scribal procedures had imposed upon technically proficient masters in the past.

fresh observations could at long last be duplicated without being blurred or blotted out over the course of time.

“Considering the preservative powers of print: fixity and cumulative change”

Of all the new features introduced by the duplicative powers of print, preservation is possibly the most important.

as edicts became more visible, they also became more irrevocable. Magna Carta, for example, was ostensibly “published”

Copying, memorizing, and transmitting absorbed fewer energies.

“Amplification and reinforcement: the persistence of stereotypes and of sociolinguistic divisions”

Both “stereotype” and “cliché” are terms deriving from typographical processes developed three and a half centuries after Gutenberg.

an unwitting collaboration between countless authors of new books and articles. For five hundred years, authors have jointly transmitted certain old messages with augmented frequency even while separately reporting on new events or spinning out new ideas.

 

Eisenstein, Elizabeth L. (2012-03-29). The Printing Revolution in Early Modern Europe (Canto Classics). Cambridge University Press. Kindle Edition.

Is This Post True?

There’s a lot of discussion about what constitutes fake news, what impact it has, and whether blocking it is the bigger threat.

I’d like to instead talk about the perception of truth. How do we change the norms of publishing so that a mistake is detected rather than amplified? How do we broaden the context that an article lives within and dynamically update it as time passes and others expand on the original article? In what context are writers presenting their reporting and opinions?

I started reading Neil Postman’s “Amusing Ourselves to Death” the other day. It was written in 1985, but feels very relevant today. I’m not that far into it, but the second chapter delves into the interplay between truth, and how the medium we are communicating in shapes that truth.

As a culture moves from orality to writing to printing to televising, its ideas of truth move with it. … Truth, like time itself, is a product of a conversation man has with himself about and through the techniques of communication he has invented.

that there is a content called “the news of the day”— was entirely created by the telegraph (and since amplified by newer media)… The news of the day is a figment of our technological imagination. … Cultures without speed-of-light media -— let us say, cultures in which smoke signals are the most efficient space-conquering tool available -— do not have news of the day.

It is not really a deep insight to say that the news of the minute, the “trending” news is something we have created with the systems we have built. Lots has (somewhat rightly) focused on Facebook and Google, but the Open Web is much larger than that.

We’ve lowered the barrier to publishing. We’ve changed the medium through which we express truth, but we haven’t really changed the norms or means by which we enable readers to judge truth.

Let’s compare how truth is perceived in some different mediums:
1. Newspaper journalistic standards in the 1980s (for instance) relied on “balance” and “unbiased” language and had separate sections for opinion. All of this got published by some “reputable” source. Some publisher with a long history that was known to the reader. And typically there was not really a lot of competition within any particular geographic region (though this hasn’t always been true) so it was pretty easy to know the range of writers and publishers.
2. Scientific journals rest on citations and peer review. Ostensibly the data/methods are all available so someone could reproduce the experiments. But doing so is often not easy. The perceived truth is determined both by the reputation of the journal (and by proxy, who reviewed it) as well as explicitly referencing other papers that may disagree. A paper that ignores existing literature is unlikely to get past reviewers.
3. For an article on the Web we borrow a lot of our norms from (1) and (2). We have also added a social validation of truth by displaying a count of likes, the number of comments, or the number of people who have shared an article. Rarely is anything shown about the people providing the social validation other than a simple count. Sometimes it is also where the content ranks on in a search, or perhaps what the post is linking to. But the text of a link and the entire content is entirely within the writers control.

Obviously, for all of these cases I am ignoring lots of background on all the work that people do in research, investigation, validation, and writing. The point here is not about what went into publishing it, but rather what does a new reader see? Why does a reader trust it? Of course the reader can judge the written evidence for themselves, but we know of a huge list of cognitive biases that the reader must contend with. Additionally, it takes a lot of time to research the truth of an article.

Both (1) and (2) do nothing to handle the case where the author has simply made a mistake. They have mechanisms for referring to articles published before the current one, but the web is capable of dynamically updating to refer to things published afterwards. An articles does not need to be static.

How do we take the best of (1) and (2) and adapt them into a world where the reader can better judge truth in (3)?

I don’t really have an answer. I’m not sure these are even the right questions, but let’s try an idea, right here and now since this post is currently asking you to determine its validity. Don’t just judge just my words. Judge my words against the words of others in the world. You’ve read this far, let’s add some related content to this article and see how it affects your personal search for the truth. Then let’s reconvene below and discuss some more…

 

Onward…

How has your perspective of this post changed? Did you open any of the links? Did you get lost in them? Were you overwhelmed by them? Do you feel that you can more accurately judge truth? Do you find my ideas more valid or less valid? Getting back to my click-bait headline: Is this post in accordance with fact or reality (true)?

I haven’t actually read all of those posts. I looked at a few. The Stanford Web Credibility Project from 2002 was very interesting and relevant. Its top recommendation is to:

1. Make it easy to verify the accuracy of the information on your site.

What if in order for a writer’s opinion or reporting of news to be considered a part of humanity’s search for truth that writer is expected to publish it alongside others’ content? What if publishers are expected to give others traffic in order for your words to have any weight? Why should I care what you have to say if you’re afraid to algorithmically link me to other people who are providing other opinions/insights? What if readers learn to instantly dismiss any article that is not willing to automatically link to others?

Displaying related content is a fundamental part of the web now, but so far we have mostly only used it to keep users on our own sites, or to make money with advertisements. Maybe there is another use case? Yes, the user could go to a search engine, but maybe we can improve the truth seeker’s user experience beyond that.

This related content would not need to be static. As posts link to yours, they may get weighted more heavily and show up on your post. So your post does not only link backward in time, it is a living document providing a link to how others have built on your work.

Some systems already try to do this. WordPress has pingbacks where a link from some other site generates a comment on the site it is linking to. It is an attempt to keep an old post connected to the conversation, but there is no weighting of it relative to others. It doesn’t really scale for a very popular post. And a post that is two hops back in the chain is not necessarily considered.

Of course a related content system still has potential problems of bias due to who controls the algorithms. Open sourcing the algorithms would help, as would having a standard mechanism where multiple providers can provide the service in a compatible way. Building trust here would be hard. Getting publishers to trust content from competing publishers to be inserted into their page would be especially difficult.

But maybe by changing the norms through which we judge truth we can get back to seeking truth together. Or maybe there are other ideas for how we should be presenting our articles to help humanity find truth.

 

No Handlebars

Was listening to the Flobots earlier on a run. Feels very appropriate to be thinking about the impact of technology on society and how to do better.

Uniform priors now seem much less appropriate to me than they used to. Time for this blog to be less neutral also. 

Great Links to Lucene/Solr Revolution 2015

A great list of links to slides from Mani Siva’s blog. A lot of things I’m currently thinking about: overlaying graphs on Elasticsearch to do content re-ranking, improving search relevancy, what is “fairness” in ranking and how you display content.

 

One of the annual conferences that I always look forward to is Lucene/Solr Revolution. The reason is not only because of highly technical nature of the conference, but also you can get a glimpse of how future of search is evolving in the open and how Solr is being pushed to its limits to handle […]

via Lucene/Solr Revolution 2015 — Mani Siva’s blog

The Walsh Standard v Automattic Creed

I’m reading Bill Walsh’s book The Score Takes Care of Itself on his methodology for getting the San Francisco 49ers to perform at a high level in the 1980s (and win 3 Super Bowls in the process), and I found it interesting how closely his Standard of Performance matches up against the Automattic Creed. I thought I would compare the two: Walsh’s clause in black, relavant Automattic line(s) in red, and some notes from me.

Exhibit a ferocious and intelligently applied work ethic directed at continual improvement;

I will never stop learning.

…the only way to get there is by putting one foot in front of another every day.

In both cases the first line is about learning.

demonstrate respect for each person in the organization and the work he or she does;

I will never pass up an opportunity to help out a colleague, and I’ll remember the days before I knew everything.

Not really a perfect match, but to help someone with humility implies respecting them and who they are.

be deeply committed to learning and teaching, which means increasing my own expertise;

I will never stop learning.

I will never pass up an opportunity to help out a colleague, and I’ll remember the days before I knew everything.

Learning and teaching go hand in hand.

be fair;

I will build our business sustainably through passionate and loyal customers.

I think its a stretch to say that the Creed mentions fairness so openly. But this line captures the fairness we strive to apply to our customers.

demonstrate character;

Really the whole Automattic Creed is about character. Being written in the first person means it is defining the character that we are all agreeing to try and match.

honor the direct connection between details and improvement, and relentlessly seek the latter;

I will never stop learning.

I am in a marathon, not a sprint, and no matter how far away the goal is, the only way to get there is by putting one foot in front of another every day

Systematic learning. I like Walsh’s focus on the details mattering though.

show self-control, especially where it counts most— under pressure;

I am more motivated by impact than money, and I know that Open Source is one of the most powerful ideas of our generation.

Choosing what to work on and how to work on it is all about self control.

demonstrate and prize loyalty;

We don’t really go for loyalty oaths. We have a great retention rate though and really quite a lot of loyalty and friendship. Getting together with Automatticians feels like getting together with family.

use positive language and have a positive attitude;

I’ll remember the days before I knew everything

This is really the closest thing we have to mentioning a positive attitude in the Creed.

take pride in my effort as an entity separate from the result of that effort;

Open Source is one of the most powerful ideas of our generation

I guess that kinda works… Again not strong overlap. “Pride” is actually the first line in the Automattic Designer’s Creed.

be willing to go the extra distance for the organization;

I am in a marathon, not a sprint

I think there is a pretty important difference in philosophy here. Walsh is pretty strongly into personal sacrifice in what I would call an unsustainable way — “If you’re up at 3 A.M. every night talking into a tape recorder and writing notes on scraps of paper, have a knot in your stomach and a rash on your skin, are losing sleep and losing touch with your wife and kids, have no appetite or sense of humor, and feel that everything might turn out wrong, then you’re probably doing the job.”

To me you are not doing the job at all. You’re going to write terrible code, poorly think it through, and have a really hard time sustaining this pace. Everyone has crunchtimes, but this is not a sustainable way to live. Nor is it a sustainable way to build products. I have similar problems with the Agile methodology and its focus on “sprints”.

Ironically though, I am writing this post when I should be asleep.

deal appropriately with victory and defeat, adulation and humiliation (don’t get crazy with victory nor dysfunctional with loss);

Given time, there is no problem that’s insurmountable.

Keep working at it, don’t give up. This can be really hard when you spend months fighting to fix one scaling problem after another and it is impossible to know where the end is.

promote internal communication that is both open and substantive (especially under stress);

I will communicate as much as possible, because it’s the oxygen of a distributed company

Open communication is the type of communication that really matters.

seek poise in myself and those I lead;

No real direct comparison. To me poise is keeping your cool when you realize that some code you launched is causing a performance problem about to bring down WordPress.com and how do you diligently and swiftly solve the problem.

put the team’s welfare and priorities ahead of my own;

I don’t really want a similar line in the Creed. Yes, I sacrifice some of the things I would love to be working on in order to pursue Automattic’s goals, but my welfare and priorities are pretty well aligned with the company’s.

 

maintain an ongoing level of concentration and focus that is abnormally high;

I am in a marathon, not a sprint, and no matter how far away the goal is, the only way to get there is by putting one foot in front of another every day

To me the level of concentration needed is to be able to focus on what is most important, say ‘no’ to the sorta important things, and get up the next day and make that correct decision yet again. I don’t make the correct decision every day, but hopefully I will quickly recognize when I make the wrong one.

and make sacrifice and commitment the organization’s trademark.

Given time, there is no problem that’s insurmountable

Again Walsh focuses a lot on sacrifice. I think that’s the wrong thing to focus on. There is certainly some involved in any endeavor, but the way he discusses it in his Standard of Performance, and elsewhere in the book doesn’t feel very sustainable.


 

There’s some interesting differences between the two, but the focus on teaching, openness, and being systematic in making progress stand out pretty strongly. For reference, here is the entire Automattic Creed:

I will never stop learning. I won’t just work on things that are assigned to me. I know there’s no such thing as a status quo. I will build our business sustainably through passionate and loyal customers. I will never pass up an opportunity to help out a colleague, and I’ll remember the days before I knew everything. I am more motivated by impact than money, and I know that Open Source is one of the most powerful ideas of our generation. I will communicate as much as possible, because it’s the oxygen of a distributed company. I am in a marathon, not a sprint, and no matter how far away the goal is, the only way to get there is by putting one foot in front of another every day. Given time, there is no problem that’s insurmountable.

 

Learning About Modern Neural Networks

I’ve been meaning to learn about modern neural networks and “deep learning” for a while now. Lots to read out there, but I finally found this great survey paper by Lipton and Berkowitz: A Critical Review of Recurrent Neural Networks for Sequence Learning[pdf]. It may be a little hard to follow if you haven’t previously learned basic feed forward neural networks and backpropagation, but for me it had just the right combination of both describing the math while also summarizing how the state of the art is applicable to various tasks (machine translation for instance).

Screen Shot 2015-12-28 at 7.41.13 AM

A two cell Long-Short-Term-Memory network unrolled across two time steps. Image from Lipton and Berkowitz.

I’ve been pretty fascinated with RNNs since reading “The Unreasonable Effectiveness of RNNs” earlier this year. A character based model that can be good enough to generate psuedo code where the parenthesis are correctly closed was really impressive. It was a very inspiring read, but still left me unable to really grok what is really different about the state of the art NNs. I finally feel like I’m starting to understand and I’ve gotten a few of the Tensorflow examples running and started to play with modifying them.

Deep learning seemed to jump onto the scene just after I finished my NLP Masters degree about 5 years ago. I hadn’t really found the time to fully understand it since then, but it feels like I’ve avoided really learning it for too long now. Given the huge investments Google, Facebook, and others are putting into building large scalable software systems or customizing hardware for processing NNs at scale, it no longer just seems like hype with clever naming.

If you’re interested in more reading, Jiwon Kim has a great list.

 

Six Use Cases of Elasticsearch in WordPress

I spoke at WordCamp US two weeks ago about six different use cases I’ve seen for Elasticsearch within the WordPress community.

I also mentioned two projects for WordPress.org that are planned for 2016 and involve Elasticsearch if anyone is looking for opportunities to learn to use Elasticsearch and contribute to WordPress:

The video is here:

 

And here are my slides:

2014 In Review

Another year at Automattic, another 3 million emails sent to users to point them at their annual reports. This is my fourth year helping to send the annual reports, and as usual it makes for an exhausting December for everyone involved. But its pretty rewarding seeing how much people like them on Twitter. This year I’m particularly excited to see tweets from non-English users because we’ve expanded our translation of the reports from 6 to 16 languages.

Our designer wrote up a summary of how he thinks of our report:

A billboard shines upon the magical city of Anyville, USA and tells the tale of one remarkable blogger’s dedication and their achievements throughout the year that was. As 2015 edges closer the citizens of Anyville stop and look up in awe at the greatness that is you.

Its so incredible working with clever designers who know how to make data speak to users.

Here’s what I took out of the 100,000 views from 58,000 visitors my blog had in 2014:

  • Multi-lingual indexing is a pain point for a lot of Elasticsearch users (ourselves included). I’ve learned so much from the comments on that post.
  • Everyone really likes that two year old post about things I did wrong with ES. I should write another one. I’ve done lots of other things wrong since then. 🙂
  • I am very inconsistent about when I post. I should try writing shorter posts. Writing a series of posts was a good idea, but a bit overwhelming.
  • About half of my traffic is from search engines.
  • I get a lot of traffic from folks posting links to me on Stack Overflow. Makes me really happy to see my posts helping answer so many questions.

Click here to see the complete report.

Elasticsearch: The Broken Bits

A post like this carries a heavy risk of this:

So, some friendly caveats:

  • I wrote the first draft of this a bit after midnight in the middle of a week where we weathered 3 ES outages on two different clusters, none of which were the fault of ES, but all of which could have been much less painful.
  • We’re currently running ES 1.3.4. We’ve been using ES since 0.17, and it has gotten immensely better along the way. We couldn’t build many features without it.
  • I know many of these problems are being actively worked on, I’m publishing this in the hopes of providing more context to the series of github issues I’ve submitted.

With that out of the way, I think ES has some significant problems when it comes to operating and iterating on a large cluster with terabytes of data. I think these problems contribute to misunderstandings about ES and end up giving developers and sys admins a bad experience. I think many of the fixes (though not all) require rethinking default configurations rather than implementing complex algorithms.

Shard Recovery When a Node Fails

This sort of complaint about ES seems to come up regularly:

Reading the original issue, the confusion is around how the “elastic” part of ES moves shards around. I’ll quote:

  1. Cluster has 3 data nodes, A, B, and C. The index has a replica count of 1…
  2. A crashes. B takes over as master, and then starts transferring data to C as a new replica.
  3. B crashes. C is now master with an impartial dataset.
  4. There is a write to the index.
  5. A and B finally reboot, and they are told that they are now stale. Both A and B delete their local data.
  6. all the data A and B had which C never got is lost forever.

Two of three nodes failed back to back, and there were only 2 replicas of the data. ES devs correctly pointed out that you should expect to lose data. However, the user lost almost all of the existing data because the shard was moved from A to C. Almost all of the data existed on the disk of A, but it was deleted.

Elasticsearch is being too elastic. Its great to reshuffle data when nodes get added to a cluster. When a node fails though, reshuffling data is almost always a terrible idea. Let’s go through a timeline we see when an ES node fails in production:

  1. Cluster is happily processing millions of queries and index ops an hour…
  2. A node fails and drops out of the cluster
  3. The master allocates all the shards that were on that node to another node and starts recovery on those nodes.
  4. Meanwhile: alerts are fired to ops, pagers go off, people start fixing things.
  5. The shards being recovered start pulling terabytes of data over 1Gbit connections. Recovering all shards will take hours.
  6. Ops fixes whatever minor hardware problem occurred and the down node rejoins the cluster.
  7. All of the data on the down node gets tossed away. That node has nothing to do.
  8. Eventually all of the data is recovered.
  9. Slowly, shards are relocated to the down node
  10. About 10-16 hours later the cluster has finally recovered.

I’ve seen this happen at least a half dozen times. A single node going down triggers a half day event. Many nodes going down can take a day or two to recover.

There’s at least a couple of ways things can go very wrong because of the above flow:

  • Because it takes days to recover all the shards of your data, if there is another hardware failure (or human error) during this time then you are at risk of not having enough redundancy of your data. As it currently stands eventually we’ll catch the triple event within a week that will cause us to lose data.
  • Moving that much data across the network puts a huge strain on your infrastructure. You can easily cause other problems when there is already something going wrong. A router going down and bringing down part of your ES cluster is exactly the wrong time to move around lots of data and further increase your network load.
  • When a node fails its shards get allocated elsewhere. Very often this means that if enough nodes fail at once, then you can end up running out of disk space. When I’ve got 99 shards and very little disk space on a node why the %&#^ are you trying to put 40 more on that node? I’ve watched in slow motion horror as my disk space shrinks as I try and delete shards to force them onto other nodes with very little success.
  • Sometimes you won’t have enough replicas – oops, you made a mistake – but losing data does not need to be a binary event. Losing 5% of your data is not the same as losing 95%. Yes, you need to reindex in either case, but for many search applications you can work off a slightly stale index and users will never notice while you rebuild a new one.
  • Because a down node triggers moving shards around, you can’t just do a restart of your cluster, you need a procedure to restart your cluster. In our case that’s evolved into a 441 line bash script!

I could complain about this problem for a while, but you get the idea.

Why don’t we just recover the local data? By default ES should not move shards around when a node goes down. If something is wrong with the cluster, why take actions that will make the situation worse? I’ve slowly come to the conclusion that moving data around is the wrong approach.

Cluster Restart Times

Actual IRC transcript between me and our head of systems:

gibrown – restarting ES cluster
gibrown – will take three days, cluster will be yellow during that time
bazza – maybe we should make `deploy wpcom` take 3 days 🙂
bazza – then we could call ourselves Elastic WordPress!

Ouch… that stung. At WordPress.com we do about 100-200 deploys a day and each updates thousands of servers in less than 20 seconds. We optimize for iteration speed and fixing problems quickly because performance problems in particular are hard to predict. Elasticsearch development in comparison has been excruciatingly slow.

We recently did an upgrade from 1.2.1 to 1.3.4. It took 6 or 7 days to complete because ES replicas do not stay in sync with each other. As real time indexing continues the replicas each do their own indexing so when a node get restarted its shard recovery requires moving all of the modified segments from another node. After bulk indexing, this means moving your entire data set across the network once for each replica you have.

Recently, we’ve been expanding our indexing from the 800 mil docs we had to about 5 billion. I am not exaggerating when I say that we’ve spent at least two months out of the last twelve waiting for ES to restart after upgrading versions, changing our indexing, or trying to improve our configuration.

Resyncing across the network completely kills our productivity and how fast we can iterate. I’d go so far as to say it is our fundamental limitation to using ES in more applications. It feels like this syncing should at least partially happen in the backgorund whenever the cluster is green.

Field Data and Heap

So you want to sort 1000 of your 5 billion documents by date? Sure we can do that, just hold on a second while I load all 5 billion dates into memory…

I didn’t fully appreciate this limitation until just recently. Field data has been a thorn in ES developers sides from the beginning. It is the single easiest way to crash your cluster, and I bet everyone has done it. The new-ish breakers help a lot, but they still don’t let me do what I’m trying to do. I don’t need 5 billion dates to sort 1000 docs, I only need 1000.

I shouldn’t complain too loudly about this, because there is actually a solution! doc_values will store your fields on disk so that you don’t need to load all values into memory, only those that you need. They are a little slower, but slower is far better than non-functional. I think they should be the default.

Onward

Elasticsearch has evolved a lot over the past years, I think its time to rethink a number of default settings to improve the configuration when dealing with large amounts of data. A lot of this comes down to thinking about the user experience of Elasticsearch. A lot of thought has clearly been put into the user experience for a new user just starting out. That’s what makes ES so addictive. I think some more thought could be put into the experience of sys admins and developers who are iterating on large data sets.