No Handlebars

Was listening to the Flobots earlier on a run. Feels very appropriate to be thinking about the impact of technology on society and how to do better.

Uniform priors now seem much less appropriate to me than they used to. Time for this blog to be less neutral also. 

Great Links to Lucene/Solr Revolution 2015

A great list of links to slides from Mani Siva’s blog. A lot of things I’m currently thinking about: overlaying graphs on Elasticsearch to do content re-ranking, improving search relevancy, what is “fairness” in ranking and how you display content.

 

One of the annual conferences that I always look forward to is Lucene/Solr Revolution. The reason is not only because of highly technical nature of the conference, but also you can get a glimpse of how future of search is evolving in the open and how Solr is being pushed to its limits to handle […]

via Lucene/Solr Revolution 2015 — Mani Siva’s blog

The Walsh Standard v Automattic Creed

I’m reading Bill Walsh’s book The Score Takes Care of Itself on his methodology for getting the San Francisco 49ers to perform at a high level in the 1980s (and win 3 Super Bowls in the process), and I found it interesting how closely his Standard of Performance matches up against the Automattic Creed. I thought I would compare the two: Walsh’s clause in black, relavant Automattic line(s) in red, and some notes from me.

Exhibit a ferocious and intelligently applied work ethic directed at continual improvement;

I will never stop learning.

…the only way to get there is by putting one foot in front of another every day.

In both cases the first line is about learning.

demonstrate respect for each person in the organization and the work he or she does;

I will never pass up an opportunity to help out a colleague, and I’ll remember the days before I knew everything.

Not really a perfect match, but to help someone with humility implies respecting them and who they are.

be deeply committed to learning and teaching, which means increasing my own expertise;

I will never stop learning.

I will never pass up an opportunity to help out a colleague, and I’ll remember the days before I knew everything.

Learning and teaching go hand in hand.

be fair;

I will build our business sustainably through passionate and loyal customers.

I think its a stretch to say that the Creed mentions fairness so openly. But this line captures the fairness we strive to apply to our customers.

demonstrate character;

Really the whole Automattic Creed is about character. Being written in the first person means it is defining the character that we are all agreeing to try and match.

honor the direct connection between details and improvement, and relentlessly seek the latter;

I will never stop learning.

I am in a marathon, not a sprint, and no matter how far away the goal is, the only way to get there is by putting one foot in front of another every day

Systematic learning. I like Walsh’s focus on the details mattering though.

show self-control, especially where it counts most— under pressure;

I am more motivated by impact than money, and I know that Open Source is one of the most powerful ideas of our generation.

Choosing what to work on and how to work on it is all about self control.

demonstrate and prize loyalty;

We don’t really go for loyalty oaths. We have a great retention rate though and really quite a lot of loyalty and friendship. Getting together with Automatticians feels like getting together with family.

use positive language and have a positive attitude;

I’ll remember the days before I knew everything

This is really the closest thing we have to mentioning a positive attitude in the Creed.

take pride in my effort as an entity separate from the result of that effort;

Open Source is one of the most powerful ideas of our generation

I guess that kinda works… Again not strong overlap. “Pride” is actually the first line in the Automattic Designer’s Creed.

be willing to go the extra distance for the organization;

I am in a marathon, not a sprint

I think there is a pretty important difference in philosophy here. Walsh is pretty strongly into personal sacrifice in what I would call an unsustainable way — “If you’re up at 3 A.M. every night talking into a tape recorder and writing notes on scraps of paper, have a knot in your stomach and a rash on your skin, are losing sleep and losing touch with your wife and kids, have no appetite or sense of humor, and feel that everything might turn out wrong, then you’re probably doing the job.”

To me you are not doing the job at all. You’re going to write terrible code, poorly think it through, and have a really hard time sustaining this pace. Everyone has crunchtimes, but this is not a sustainable way to live. Nor is it a sustainable way to build products. I have similar problems with the Agile methodology and its focus on “sprints”.

Ironically though, I am writing this post when I should be asleep.

deal appropriately with victory and defeat, adulation and humiliation (don’t get crazy with victory nor dysfunctional with loss);

Given time, there is no problem that’s insurmountable.

Keep working at it, don’t give up. This can be really hard when you spend months fighting to fix one scaling problem after another and it is impossible to know where the end is.

promote internal communication that is both open and substantive (especially under stress);

I will communicate as much as possible, because it’s the oxygen of a distributed company

Open communication is the type of communication that really matters.

seek poise in myself and those I lead;

No real direct comparison. To me poise is keeping your cool when you realize that some code you launched is causing a performance problem about to bring down WordPress.com and how do you diligently and swiftly solve the problem.

put the team’s welfare and priorities ahead of my own;

I don’t really want a similar line in the Creed. Yes, I sacrifice some of the things I would love to be working on in order to pursue Automattic’s goals, but my welfare and priorities are pretty well aligned with the company’s.

 

maintain an ongoing level of concentration and focus that is abnormally high;

I am in a marathon, not a sprint, and no matter how far away the goal is, the only way to get there is by putting one foot in front of another every day

To me the level of concentration needed is to be able to focus on what is most important, say ‘no’ to the sorta important things, and get up the next day and make that correct decision yet again. I don’t make the correct decision every day, but hopefully I will quickly recognize when I make the wrong one.

and make sacrifice and commitment the organization’s trademark.

Given time, there is no problem that’s insurmountable

Again Walsh focuses a lot on sacrifice. I think that’s the wrong thing to focus on. There is certainly some involved in any endeavor, but the way he discusses it in his Standard of Performance, and elsewhere in the book doesn’t feel very sustainable.


 

There’s some interesting differences between the two, but the focus on teaching, openness, and being systematic in making progress stand out pretty strongly. For reference, here is the entire Automattic Creed:

I will never stop learning. I won’t just work on things that are assigned to me. I know there’s no such thing as a status quo. I will build our business sustainably through passionate and loyal customers. I will never pass up an opportunity to help out a colleague, and I’ll remember the days before I knew everything. I am more motivated by impact than money, and I know that Open Source is one of the most powerful ideas of our generation. I will communicate as much as possible, because it’s the oxygen of a distributed company. I am in a marathon, not a sprint, and no matter how far away the goal is, the only way to get there is by putting one foot in front of another every day. Given time, there is no problem that’s insurmountable.

 

Learning About Modern Neural Networks

I’ve been meaning to learn about modern neural networks and “deep learning” for a while now. Lots to read out there, but I finally found this great survey paper by Lipton and Berkowitz: A Critical Review of Recurrent Neural Networks for Sequence Learning[pdf]. It may be a little hard to follow if you haven’t previously learned basic feed forward neural networks and backpropagation, but for me it had just the right combination of both describing the math while also summarizing how the state of the art is applicable to various tasks (machine translation for instance).

Screen Shot 2015-12-28 at 7.41.13 AM

A two cell Long-Short-Term-Memory network unrolled across two time steps. Image from Lipton and Berkowitz.

I’ve been pretty fascinated with RNNs since reading “The Unreasonable Effectiveness of RNNs” earlier this year. A character based model that can be good enough to generate psuedo code where the parenthesis are correctly closed was really impressive. It was a very inspiring read, but still left me unable to really grok what is really different about the state of the art NNs. I finally feel like I’m starting to understand and I’ve gotten a few of the Tensorflow examples running and started to play with modifying them.

Deep learning seemed to jump onto the scene just after I finished my NLP Masters degree about 5 years ago. I hadn’t really found the time to fully understand it since then, but it feels like I’ve avoided really learning it for too long now. Given the huge investments Google, Facebook, and others are putting into building large scalable software systems or customizing hardware for processing NNs at scale, it no longer just seems like hype with clever naming.

If you’re interested in more reading, Jiwon Kim has a great list.

 

Six Use Cases of Elasticsearch in WordPress

I spoke at WordCamp US two weeks ago about six different use cases I’ve seen for Elasticsearch within the WordPress community.

I also mentioned two projects for WordPress.org that are planned for 2016 and involve Elasticsearch if anyone is looking for opportunities to learn to use Elasticsearch and contribute to WordPress:

The video is here:

 

And here are my slides:

2014 In Review

Another year at Automattic, another 3 million emails sent to users to point them at their annual reports. This is my fourth year helping to send the annual reports, and as usual it makes for an exhausting December for everyone involved. But its pretty rewarding seeing how much people like them on Twitter. This year I’m particularly excited to see tweets from non-English users because we’ve expanded our translation of the reports from 6 to 16 languages.

Our designer wrote up a summary of how he thinks of our report:

A billboard shines upon the magical city of Anyville, USA and tells the tale of one remarkable blogger’s dedication and their achievements throughout the year that was. As 2015 edges closer the citizens of Anyville stop and look up in awe at the greatness that is you.

Its so incredible working with clever designers who know how to make data speak to users.

Here’s what I took out of the 100,000 views from 58,000 visitors my blog had in 2014:

  • Multi-lingual indexing is a pain point for a lot of Elasticsearch users (ourselves included). I’ve learned so much from the comments on that post.
  • Everyone really likes that two year old post about things I did wrong with ES. I should write another one. I’ve done lots of other things wrong since then. 🙂
  • I am very inconsistent about when I post. I should try writing shorter posts. Writing a series of posts was a good idea, but a bit overwhelming.
  • About half of my traffic is from search engines.
  • I get a lot of traffic from folks posting links to me on Stack Overflow. Makes me really happy to see my posts helping answer so many questions.

Click here to see the complete report.

Elasticsearch: The Broken Bits

A post like this carries a heavy risk of this:

So, some friendly caveats:

  • I wrote the first draft of this a bit after midnight in the middle of a week where we weathered 3 ES outages on two different clusters, none of which were the fault of ES, but all of which could have been much less painful.
  • We’re currently running ES 1.3.4. We’ve been using ES since 0.17, and it has gotten immensely better along the way. We couldn’t build many features without it.
  • I know many of these problems are being actively worked on, I’m publishing this in the hopes of providing more context to the series of github issues I’ve submitted.

With that out of the way, I think ES has some significant problems when it comes to operating and iterating on a large cluster with terabytes of data. I think these problems contribute to misunderstandings about ES and end up giving developers and sys admins a bad experience. I think many of the fixes (though not all) require rethinking default configurations rather than implementing complex algorithms.

Shard Recovery When a Node Fails

This sort of complaint about ES seems to come up regularly:

Reading the original issue, the confusion is around how the “elastic” part of ES moves shards around. I’ll quote:

  1. Cluster has 3 data nodes, A, B, and C. The index has a replica count of 1…
  2. A crashes. B takes over as master, and then starts transferring data to C as a new replica.
  3. B crashes. C is now master with an impartial dataset.
  4. There is a write to the index.
  5. A and B finally reboot, and they are told that they are now stale. Both A and B delete their local data.
  6. all the data A and B had which C never got is lost forever.

Two of three nodes failed back to back, and there were only 2 replicas of the data. ES devs correctly pointed out that you should expect to lose data. However, the user lost almost all of the existing data because the shard was moved from A to C. Almost all of the data existed on the disk of A, but it was deleted.

Elasticsearch is being too elastic. Its great to reshuffle data when nodes get added to a cluster. When a node fails though, reshuffling data is almost always a terrible idea. Let’s go through a timeline we see when an ES node fails in production:

  1. Cluster is happily processing millions of queries and index ops an hour…
  2. A node fails and drops out of the cluster
  3. The master allocates all the shards that were on that node to another node and starts recovery on those nodes.
  4. Meanwhile: alerts are fired to ops, pagers go off, people start fixing things.
  5. The shards being recovered start pulling terabytes of data over 1Gbit connections. Recovering all shards will take hours.
  6. Ops fixes whatever minor hardware problem occurred and the down node rejoins the cluster.
  7. All of the data on the down node gets tossed away. That node has nothing to do.
  8. Eventually all of the data is recovered.
  9. Slowly, shards are relocated to the down node
  10. About 10-16 hours later the cluster has finally recovered.

I’ve seen this happen at least a half dozen times. A single node going down triggers a half day event. Many nodes going down can take a day or two to recover.

There’s at least a couple of ways things can go very wrong because of the above flow:

  • Because it takes days to recover all the shards of your data, if there is another hardware failure (or human error) during this time then you are at risk of not having enough redundancy of your data. As it currently stands eventually we’ll catch the triple event within a week that will cause us to lose data.
  • Moving that much data across the network puts a huge strain on your infrastructure. You can easily cause other problems when there is already something going wrong. A router going down and bringing down part of your ES cluster is exactly the wrong time to move around lots of data and further increase your network load.
  • When a node fails its shards get allocated elsewhere. Very often this means that if enough nodes fail at once, then you can end up running out of disk space. When I’ve got 99 shards and very little disk space on a node why the %&#^ are you trying to put 40 more on that node? I’ve watched in slow motion horror as my disk space shrinks as I try and delete shards to force them onto other nodes with very little success.
  • Sometimes you won’t have enough replicas – oops, you made a mistake – but losing data does not need to be a binary event. Losing 5% of your data is not the same as losing 95%. Yes, you need to reindex in either case, but for many search applications you can work off a slightly stale index and users will never notice while you rebuild a new one.
  • Because a down node triggers moving shards around, you can’t just do a restart of your cluster, you need a procedure to restart your cluster. In our case that’s evolved into a 441 line bash script!

I could complain about this problem for a while, but you get the idea.

Why don’t we just recover the local data? By default ES should not move shards around when a node goes down. If something is wrong with the cluster, why take actions that will make the situation worse? I’ve slowly come to the conclusion that moving data around is the wrong approach.

Cluster Restart Times

Actual IRC transcript between me and our head of systems:

gibrown – restarting ES cluster
gibrown – will take three days, cluster will be yellow during that time
bazza – maybe we should make `deploy wpcom` take 3 days 🙂
bazza – then we could call ourselves Elastic WordPress!

Ouch… that stung. At WordPress.com we do about 100-200 deploys a day and each updates thousands of servers in less than 20 seconds. We optimize for iteration speed and fixing problems quickly because performance problems in particular are hard to predict. Elasticsearch development in comparison has been excruciatingly slow.

We recently did an upgrade from 1.2.1 to 1.3.4. It took 6 or 7 days to complete because ES replicas do not stay in sync with each other. As real time indexing continues the replicas each do their own indexing so when a node get restarted its shard recovery requires moving all of the modified segments from another node. After bulk indexing, this means moving your entire data set across the network once for each replica you have.

Recently, we’ve been expanding our indexing from the 800 mil docs we had to about 5 billion. I am not exaggerating when I say that we’ve spent at least two months out of the last twelve waiting for ES to restart after upgrading versions, changing our indexing, or trying to improve our configuration.

Resyncing across the network completely kills our productivity and how fast we can iterate. I’d go so far as to say it is our fundamental limitation to using ES in more applications. It feels like this syncing should at least partially happen in the backgorund whenever the cluster is green.

Field Data and Heap

So you want to sort 1000 of your 5 billion documents by date? Sure we can do that, just hold on a second while I load all 5 billion dates into memory…

I didn’t fully appreciate this limitation until just recently. Field data has been a thorn in ES developers sides from the beginning. It is the single easiest way to crash your cluster, and I bet everyone has done it. The new-ish breakers help a lot, but they still don’t let me do what I’m trying to do. I don’t need 5 billion dates to sort 1000 docs, I only need 1000.

I shouldn’t complain too loudly about this, because there is actually a solution! doc_values will store your fields on disk so that you don’t need to load all values into memory, only those that you need. They are a little slower, but slower is far better than non-functional. I think they should be the default.

Onward

Elasticsearch has evolved a lot over the past years, I think its time to rethink a number of default settings to improve the configuration when dealing with large amounts of data. A lot of this comes down to thinking about the user experience of Elasticsearch. A lot of thought has clearly been put into the user experience for a new user just starting out. That’s what makes ES so addictive. I think some more thought could be put into the experience of sys admins and developers who are iterating on large data sets.