Elasticsearch’s faceted results are a great way to analyze the contents of a set of documents. For over a year now, Polldaddy has used Elasticsearch to create reports for the most popular answers and words given to free text survey responses. For more details take a look at the feature announcement.
However, running faceted search on such a wide array of user data can be difficult. Faceted Search in Elasticsearch can consume a lot of memory which leads to the suggestion in the ES documentation to “make sure the number of unique tokens a field can have is not large”. To make sure that we can accept any arbitrary user input we use a couple of tricks.
First let’s take a look at the mapping we use for documents in the polldaddy-survey index:
{ "polldaddy-survey" : { "freetext" : { "_routing" : { "required" : true, "path" : "survey_id" }, "properties" : { "resp_id" : { "type" : "long", }, "survey_id" : { "type" : "long", }, "text" : { "type" : "string", "analyzer" : "analyzer_polldaddy", }, "text_string" : { "type" : "string", "index" : "not_analyzed", } } } } }
Each user’s response is its own document where the response is stored in its analyzed form in the text field and unanalyzed in the text_string field. (I should have come up with better names.) By doing a terms facet query on these two fields we can get the most popular words and most popular answers respectively.
However, blindly doing this across the many millions of responses in Polldaddy would run into some serious memory problems due to the overall size of the vocabularies in those fields. For that reason we are using _routing to make sure that all documents related to a single survey go to the same shard. We then allocate a very large number of shards (100 in our case) to limit the number of unique terms in each shard. By routing our query only to one shard the amount of memory that needs to be allocated is greatly reduced, and we can even handle surveys with decent-sized vocabularies.
So just how important is routing to a single shard, well a bug snuck into our code at one point and disabled the routing. Here’s what happened to the cache memory consumption from when it was broken to when it got fixed.

Boom! Fixed a bug.
A pretty dramatic change. Without the routing to a single shard the cache would occasionally try to load a very large vocabulary and allocate 10+ GB. This of course would slow down all queries on the server.
But it could have been worse. Commenting on a previous post on this site Bruno asked me why I suggested setting index.cache.field.type: soft given that it reduces the caching performance. This memory consumption activity is why. Before setting the field cache to soft (and before adding the routing) these queries would sometimes consume so much memory that we would run out of the 24GB we had allocated on the servers. In fact it seemed like no matter how much memory we gave the servers, they would use it and cause OutOfMemory errors that would bring the cluster to a painful halt. Setting the field cache to soft is really the only way I can ensure that we won’t hit those conditions regardless of what data gets entered into a poll.
I will relish the day when there’s a good bug fix in ES for the term facet memory consumption (Issue #1531). There’s so many great applications that can be built on top of faceted searches, I’d just love to not have to worry about running out of memory because of a stray query.
A big thanks to Shay Bannon (lead ES developer) who originally suggested that I use routing and a large number of shards. And of course many thanks to my colleagues on the Polldaddy team.