Combining Search Scores: Winning and Failing

Trey Jones at Wikimedia Foundation published some very interesting notes up about how to think about combining scores for search ranking (particularly Elasticsearch). I like this insight a lot:

addition is looking for ways to win, multiplication is looking for ways to fail

This is pretty interesting to me when thinking about how I chose to implement the ranking for the WordPress.org plugin search. Applying this insight to the way I combined signals in that ranking function comes up with a couple of interesting observations:

  • The text matching features (phrases, title matches, etc) are looking for ways to win and boost the score. This was a pretty explicit goal of mine, but also partly driven by decoupling the matching of text from boosting on text.
  • All of the other signals are looking for reasons to fail. Not updating the plugin, not testing it on latest WordPress, not resolving support threads, etc. There is some boosting also, but we do a lot to lower scores which is maybe related to some of the exact matching problems I am still looking at (especially after result number 10).

I’m not sure this is either good or bad, just an interesting model for thinking about it and something I need to think about some more. This somewhat matches the intuition that led me to separate out matching text from boosting text with individual features.

I also need to think more about whether I am using the right operations for weighting different scores. There’s a lot of great thoughts in these notes and Trey has a bunch of other notes that look interesting also.

Also it reminds me how great it is to have notes published for others to look at.

 

Scaling Elasticsearch Part 3: Queries

See part 1 and part 2 for an overview of our system and how we scale our indexing. Originally I was planning a separate post for global queries and related posts queries, but it was hard to break into two posts and contributed to me taking forever to write them.

Two types of queries run on the WordPress.com Elasticsearch index: global queries across all posts and local queries that search posts within a single blog. Over 90% of our queries (23 million/day) are local, and the remainder are global (2 million/day).

In this post we’ll show some performance data for global and local queries, and discuss the tradeoffs we made to improve their performance.

An Aside about Query Testing

We used JMeter for most of our query testing. Since we are comparing global and local queries, it was important to get a variety of different queries to run and to run them on mostly full indices (though sometimes ongoing development made that impractical).

Generally we ran real user queries randomly sampled from our search logs. For local queries we were running related posts type queries with the posts pseudo-randomly selected from the index. There had to be a fair bit of hand curation of the queries due to errors in our sampling techniques and the occasional bad query type that would get through. Generating a good mix of queries is a combination of art and deciding that things are good enough.

A Few Mapping Details

I’ve already written a post on how we map WordPress posts to ES documents (and wpes-lib is on github), and there’s a description of how we handle multi-lingual data. Mostly those details are irrelevant for scaling our queries, but there are two important details:

  1. All post documents are child documents of blog documents. The blog documents record meta information about the blog.
  2. Child documents are always stored on the same shard as their parent document. They all use the same routing value. This provides us with a huge optimization. Local searches can use routing on the query so it gets executed on a single node and accesses only a single shard.

The Global Query

The original plan with our global queries was to use the parent blog documents to keep track of some meta information about the blogs to determine whether a blog’s posts were globally searchable (whether they were publicly accessible, had mature content, spam, etc). By using the blog document we’d be able to update a single document and not having to reindex all the posts on the blog. Then we’d use a has_parent filter on all global queries.

Unfortunately we found that parent/child filters did not scale well for our purposes. All parent documents ids had to be loaded into memory (about 5.3 GB of data per node for our 60 million blog documents). We had plenty of RAM available for ES (30 GB) and the id cache seemed to hold steady, but we still saw much slower queries and higher server load when using the has_parent filter. Here’s the data for the final parent/child performance test we ran (note this was tested on a 16 node cluster):

Using has_parent JMeter threads Reqs/Sec Median Latency(ms) Mean Latency(ms) Max Latency(ms) CPU % Load
yes 2 5.3 538 365 2400 45 12
no 2 10 338 190 1500 30 8
yes 4 7.6 503 514 2500 75 22
no 4 17.5 207 222 2100 50 15

Avoiding using the has_parent filter saved us a lot of servers and gave us a lot more overhead, even though it means we have to bulk reindex posts more often.

In the end our global query looks something like:

POST global-*/post/_search
{
   "query": {
      "filtered": {
         "query": {
            "multi_match": {
              "fields": ["content","title"],
              "query": "Can I haz query?"
            }
         },
         "filter": {
            "and": [
               {
                  "term": {
                     "lang": "en"
                  }
               },
               {
                  "and": [
                    {
                       "term": {
                          "spam": false
                       }
                    },
                    {
                       "term": {
                          "private": false
                       }
                    },
                    {
                       "term": {
                          "mature": false
                       }
                    }
                  ]
               }
            ]
         }
      }
   }
}

One last side note on ES global queries because it cannot be said strongly enough: FILTERS ARE EPIC. If you don’t understand why I am shouting this at you, then you need to go read about and understand Bitsets and how filters are cached.

Sidenote: after re-reading that post I realized we may be able to improve our performance a bit by switching from using AND filters shown above to bool filters. This is why blogging is good. This change cut our median query time in half:

Global Query Performance With Increasing Numbers of Shards

Global queries require gathering results from all shards and then combining the results to give the final result. A search for 10 results across 10 shards requires running the query on each of the shards, and then combining those 100 results to get the final 10 results. More shards means more processing across the cluster, but also more results that need to be combined. This gets even more interesting when you start paging results. To get results 90-100 of a search across ten shards requires requiring combining 1000 results to satisfy the search.

We did some testing of our query performance across a few million blogs as we varied the number of shards using a fixed number of JMeter threads.

Shards/Index Requests/sec Median(ms) Mean(ms)
5 320 191 328
10 320 216 291
25 289 345 332
50 183 467 531

There’s a pretty clear relationship between query latency and number of shards. This result pushed us to try and minimize the total number of shards in our cluster. We can probably also improve query performance by increasing our replication.

The Local Query

Most of our local queries are used for finding related posts. We run a combination of mlt_query queries and multi_match queries to send the text of the current post and find posts that are similar. For a post with the title “The Best”, and content of “This is the best post in the world” the query would look like:

POST global-0m-10m/post/_search?routing=12345
{
 "query": {
   "filtered": {
     "query": {
       "multi_match": {
         "query": "The Best This is the best post in the world.",
         "fields": ["mlt_content"]
       }
     },
     "filter": {
       "and": [
         {
           "term": {
             "blog_id": 12345
           }
         },
         {
           "not": {
             "term": {
               "post_id": 3
             }
           }
         }
      ]
    }
  }
 }
}

Looks simple, but there’s a lot of interesting optimizations to discuss.

Routing

All of our local queries use the search routing parameter to limit the search to a single shard. Organizing our indices and shards so that an entire blog will always be in a single shard is one of the most important optimizations in our system. Without it we would not be able to scale and handle millions of queries because we would be wasting a lot of cycles searching shards that had no or very few documents that were actually related to our search.

Query Type

In the above example, we use the multi_match query. If the content was longer (more than 100 words) we would use the mlt_query. We originally started out with all queries using mlt_query to make the queries faster and ensure good relevancy. However, using mlt_query does not guarantee that the resulting query will actually have any terms in the final search query. A lot will depend on the frequency of the terms in the post and their frequency in the index. Changing to using multi_match for short content gave us a big improvement in the relevancy of our results (as measured by click through rate).

MLT API

We started building related posts by using the MLT API for running the query. In that case we would only send the document id to ES and trust ES to get the post, analyze it, and construct the appropriate mlt_query for it. This worked pretty well, but did not scale as well as building our own query on the client. Switching to mlt_query gave us a 10x improvement in number of requests we could handle and reduced the query response time.

Operation Requests/sec Median(ms) Mean(ms)
MLT API 150 270 306
mlt_query 1200 77 167

From what I could tell the big bottleneck was getting the original document. Of course this change moved a lot of processing off of the ES cluster and onto the web hosts, but web hosts are easier to scale than ES nodes (or at least is a problem our systems team is very good at).

mlt_content Field

The query is going to a single field called mlt_content rather than to separate title and content fields. Searching fewer fields gives us a significant performance boost, and helps us search for words that occur in different fields in different posts. The fairly new multi_match cross_fields option could probably help here, but I assume would not be as performant as a single field.

For a while we were also storing the mlt_content field, but recent work has determined that storing the field did not speed up the mlt_query queries.

The history of how we ended up using mlt_content is also instructive. We started using the mlt_content field and storing it while we were still using the MLT API. Originally we were using the post title and content fields which were getting extracted from the document’s _source. Switching to a stored mlt_content field reduced the average time to get a document before building the query from about 500ms to 100ms. In the end this turned out to not be enough of a performance boost for our application, but is worth looking into for anyone using the MLT API.

Improving Relevancy with Rescoring

We’ve run a couple of tests to improve the relevancy of our related posts. Our strategy here has mostly been to use the mlt_query/multi_match queries as our basic query and then use rescoring to adjust the query results. For instance we build a query that slightly reranks our top 50 results based on the commonality between who has liked the current post and whether they liked the posts that were similar. Running this using the rescoring option had almost no impact on query performance and yet gave a nice bump to the click through rate of our related posts.

Shard Size and Local Query performance

While global queries perform best with a small number of  larger shards, local queries are fastest when the shard is smaller. Here’s some data we collected comparing the number of shards to routed query speed when running mlt_query searches:

Shards/Index Requests/sec Median(ms) Mean(ms) Max(ms)
10 1200 77 167 5900
30 1600 67 112 1100

Less data in each shard (more total shards) has a very significant impact on the number of slow queries and how slow they are.

Final Trade Off of Shard Size and Query Performance

Based off the global and local query results above we decided to go with 25 shards per index as a tradeoff between decent performance for both query types. This was one of the trickier decisions, but it worked reasonably well for us for a while. About 6 months after making this decision though we decided that we were ending up with too many slow queries (queries taking longer than 1 second).

I guess that’s my cue to tease the next post in this (possibly) never-ending series: rebuilding our indices to add 6-7x as many documents while reducing slow queries and speeding up all queries in general. We went from the 99th percentile of our queries taking 1.7 seconds down to 800ms and the median time dropping from 180ms to 50ms. All these improvements while going from 800 million docs in our cluster to 5.5 billion docs. Details coming up in my next post.

Scaling Elasticsearch Part 2: Indexing

In part 1 I gave an overview of our cluster configuration. In this part we’ll dig into:

  • How our data is partitioned into indices to scale over time
  • Optimizing bulk indexing
  • Scaling real time indexing
  • How we manage indexing failures and downtime.

The details of our document mappings are mostly irrelevant for our indexing scaling discussion, so we’ll skip them until part 3.

Data Partitioning

Since WordPress.com data is constantly growing we need an indexing structure that can grow over time as well. A well-known limitation of ES is that once an index is created you cannot change the number of shards. The common solution to this problem is to recognize that searching across an index with 10 shards is identical to searching across 10 indices with 1 shard each, and indices can be created at will.

In our case we create one index for every 10 million blogs, with 25 shards per index. We use index templates so that as our system tries to index to a non-existent index the index is created dynamically.

There are a few factors that led to our index and sharding sizes:

  1. Uniform shard sizes: Shards should be of similar sizes so that you get mostly uniform response times. Larger shards take longer to query. We tried one index per 1 million blogs and found too much variation. For instance, when we migrated Microsoft’s LiveSpaces to WordPress.com we got a million or so blogs added to our DBs in a row created and they have remained pretty active. This variation drove us to put many blogs into each index. We rely on the hashing algorithm to spread the blogs across all the shards in the index.
  2. Limit number of shards per index: New shards are not instantly created when a new index is created. Technically, I guess, the cluster state is red for a very short period. At one point we tested 200 shards per index. In those cases we sometimes saw a few document indexing failures in our real time indexing because the primary shards were still being allocated across the cluster. There’s probably some other ways around this, but its something to look out for.
  3. Upper limit of shard size: Early on we tried indexing 10 million blogs per index with only 5 shards per index. Our initial testing went well, but then we found that the indices with the larger shards (the older blogs) were experiencing much longer query latencies. This was us starting to hit the upper limits of our shard sizes. The upper limit on shard size varies by what kind of data you have and is difficult to predict so it’s not surprising that we hit it.
  4. Minimize total number of shards: We’ll discuss this further in our next post on global queries, but as the number of shards increases the efficiency of the search decreases, so reducing the number of shards helps make global queries faster.

Like all fun engineering problems, there is no easy or obvious answer and the solution comes by guessing, testing, and eventually deciding that things are good enough. In our case we figured out that our maximum shard size was around 30 GB. We then created shards that were fairly large but which we don’t think would be able to grow to that maximum for many years.

As I’m writing this, and after a few months in production, we’re actually wondering if our shards are still too large. We didn’t take into account that deleted documents would also negatively affect shard size, and every time we reindex or update a document we effectively delete the old version. Investigation into this is still ongoing, so I’m not going to try to go into the details. The number of deleted documents in your shards is related to how much real time indexing you are doing, and the merge policy settings.

Bulk Indexing Practicalities

Bulk indexing speed is a major limit in how quickly we can iterate during development, and indexing will probably be one of our limiting factors in launching new features in the future. Faster bulk indexing means faster iteration time, more testing of different shard/index configurations, and more testing of query scaling.

For these reason we ended up putting a lot of effort into speeding up our bulk indexing. We went from bulk indexing taking about two months (estimated, we never actually ran this) to taking less than a week. Improving bulk indexing speed is very iterative and application specific. There are some obvious things to pay attention to like using the bulk indexing API, and testing different numbers of docs in each bulk API request. There were also a few things that surprised me:

  • Infrastructure Load: ES Indexing puts a heavy load on certain parts of the WordPress.com infrastructure because it pulls data from so many places. In the end, our indexing bottleneck is not ES itself, but actually other pieces of our infrastructure. I suppose we could throw more infrastructure at the problem, but that’s a trade off between how often you are going to bulk reindex vs how much that infrastructure will cost.
  • Extreme Corner Cases: For instance en.blog has millions of followers, likers, and lots of commenters. Building a list of these (and keeping it up to date with real time indexing) can be very costly – like, “oh $%#@ are we really trying to index a list of 2.3 million ids” costly (and then update it every few seconds).
  • Selective Bulk Indexing: Adding fields to the index requires updating the mappings and bulk reindexing all of our data. Having a way to selectively bulk index (find all blogs of a particular type) can speed up bulk indexing a lot.
  • Cluster Restarts: After bulk indexing we need to do a full rolling restart of the cluster.

I wish we had spent more time finding and implementing bulk indexing optimizations earlier in the project.

Scaling Real Time Indexing

During normal operation, our rate of indexing (20m+ document changes a day) has never really been a problem for our Elasticsearch cluster. Our real time indexing problems have mostly stemmed from combining so many pieces of information into each document that gathering the data can be a high load on our database tables.

Creating the correct triggers to detect when a user has changed a field is often non-trivial to implement in a way that won’t over index documents. There are posts that get new comments or likes every second. Rebuilding the entire document in those cases is a complete waste of time. We make heavy use of the Update API mostly to reduce the load of recreating ES documents from scratch.

The other times when real time indexing became a problem was when something went wrong with the cluster. A couple of examples:

  • Occasionally when shards are getting relocated or initialized on a number of nodes the network can get swamped which backs up the indexing jobs or can cause a high proportion of them to start failing.
  • Query load can become too high if a non-performant query is released into production.
  • We make mistakes (shocking!) and break a few servers.
  • Occasionally we find ES bugs in production. Particularly these have been around deleteByQuery, probably because we run a lot of them.
  • A router or server fails.

Real time indexing couples other portions of our infrastructure to our indexing. If indexing gets slowed down for some reason we can create a heavy load on our DBs, or on our jobs system that runs the ES indexing (and a lot of other, more important things).

In my opinion, scaling real time indexing comes down to two pieces:

  1. How do we manage downtime and performance problems in Elasticsearch and decouple it from our other systems.
  2. When indexing fails (which it will), how do we recover and avoid bulk indexing the whole data set.

Managing Downtime

We mentioned in the first post of this series that we mark ES nodes as down if we receive particular errors from them (such as a connection error). Naturally, if a node is down, then the system has less capacity for handling indexing operations. The more nodes that are down the less indexing we can handle.

We implemented some simple heuristics to reduce the indexing load when we start to detect that a server has been down for more than a few minutes. Once triggered, we queue certain indexing jobs for later processing by just storing the blog IDs in a DB table. The longer any nodes are down, the fewer types of jobs we allow. As soon as any problems are found we disable bulk indexing of entire blogs. If problems persist for more than 5 minutes we start to disable reindexing of entire posts, and eventually we also turn off any updating of documents or deletions.

Before implementing this indexing delay mechanism we had some cases where the real time indexing overwhelmed our system. Since implementing it we haven’t seen any, and we actually smoothly weathered a failure of one of the ES network routers while maintaining our full query load.

We of course also have some switches we can throw if we want to completely disable indexing and push all blog ids that need to be reindexed into our indexing queue.

Managing Indexing Failures

Managing failures means you need to define your goals for indexing:

  • Eventually ES will have the same data as the canonical data.
  • Minimize needing to bulk re-index everything.
  • Under normal operation the index is up to date within a minute.

There are a couple of different ways our indexing can fail:

  1. An individual indexing job crashes (eg an out of memory error when we try to index 2.3 million ids 🙂 ).
  2. An individual indexing job gets an error trying to index to ES.
  3. Our heuristics delay indexing by adding it to a queue
  4. We introduce a bug to our indexing that affects a small subset of posts.
  5. We turn off real time indexing.

We have a few mechanisms that deal with these different problems.

  • Problems 1, 3, and 5: The previously mentioned indexing queue. This lets us pick up the pieces after any failure and prevent bulk reindexing everything.
  • Problem 2: When indexing jobs fail they are retried using an exponential back off mechanism (5 min, 10 min, 20 min, 40 min, …).
  • Problem 4: We run scrolling queries against the index to find the set of blogs that would have been affected by a bug, and bulk index only those blogs.

It’s All About the Failures and Iteration

Looking back on what I wish we had done better, I think a recognition that indexing is all about handling the error conditions would have been the best advice I could have gotten. Getting systems for handling failures and for faster bulk indexing in place sooner would have helped a lot.

In the next part of the series I’ll talk about query performance and balancing global and local queries.

Scaling Elasticsearch Part 1: Overview

We recently launched Related Posts across WordPress.com, so its time to pop the hood and take a look at what ended up in our engine.

There’s a lot of good information spread across the web on how to use Elasticsearch, but I haven’t seen too many detailed discussions of what it looks like to scale an ES cluster for a large application. Being an open source company means we get to talk about these details. Keep in mind though that scaling is very dependent on the end application so not all of our methods will be applicable for you.

I’m going to spread our scaling story out over four posts:

Scale

WordPress.com is in the top 20 sites on the Internet. We host very big clients all the way down the long tail where this blog resides. The primary project goal was to provide related posts on all of those page views (14 billion a month and growing). We also needed to replace our aging search engine that was running search.wordpress.com for global searches, and we have plans for many more applications in the future.

I’m only going to focus on the related posts queries (searches within a single blog) and the global queries (searches across all blogs). They illustrate some really nice features of ES.

Currently every day we average:

  • 23m queries for related posts within a single shard
  • 2m global queries across all shards
  • 13m docs indexed
  • 10m docs updated
  • 2.5m docs deleted
  • 250k delete by query operations

Our index has about 600 million documents in it, with 1.5 million added every day. Including replication there is about 9 TB of data.

System Overview

We mostly rely on the default ES settings, but there are a few key places where we override them. I’m going to leave the details of how we partition data to the post on indexing.

  • The system is running 0.90.8.
  • We have a single cluster spread across 3 data centers. There are risks with doing this. You need low latency links, and some longer timeouts to prevent communication problems within the cluster. Even with very good links between DCs local nodes are still 10x faster to reach than remote nodes:
discovery.zen.fd.ping_interval: 15s
discovery.zen.fd.ping_timeout: 60s
discovery.zen.fd.ping_retries: 5

This also helps prevent nodes from being booted if they experience long GC cycles.

  • 10 data nodes and one dedicated master node in each DC (30 data nodes total)
  • Currently 175 shards with 3 copies of each shard.
  • Disable multicast, and only list the master nodes in the unicast list. Don’t waste time pinging servers that can’t be masters.
  • Dedicated hardware (96GB RAM, SSDs, fast CPUs) – CPUs definitely seem to the bottleneck that drives us to add more nodes.
  • ES is configured to use 30GB of the RAM with indices.fielddata.cache.size set to 20GB. This is still not ideal, but better than our previous setting of index.fielddata.cache: soft. 1.0 is going to have an interesting new feature that applies a “circuit breaker” to try and detect and squash out of memory problems. Elasticsearch has come a long way with preventing OutOfMemory exceptions, but I still have painful memories from when we were running into them fairly often in 0.18 and 0.19.
  • Use shard allocation awareness to spread replicas across data centers and within data centers.
cluster.routing.allocation.awareness.attributes: dc, parity

dc is the name of the data center, parity we set to 0 or 1 based on the es host number. Even hosts are on one router and odd on another.

  • We use fixed size thread pools, very deep for indexing because we don’t want to lose index operations, much shorter for search and get since if an operation has been queued that deeply the client will probably time out by the time it gets a response.
threadpool:
  index:
    type: fixed
    size: 30
    queue_size: 1000
  bulk:
    type: fixed
    size: 30
    queue_size: 1000
  search:
    type: fixed
    size: 100
    queue_size: 200
  get:
    type: fixed
    size: 100
    queue_size: 200
  • Elastica PHP Client. I’ll add the disclaimer that I think this client is slightly over object-oriented which makes it pretty big. We use it mostly because it has great error handling, and we mostly just set raw queries rather than building a sequence of objects. This limits how much of the code we have to load because we use some PHP autoloading magic:
//Autoloader for Elastica interface client to ElasticSearch
function elastica_autoload( $class_name ) {
	if ( 'Elastica' === substr( $class_name, 0, strlen( 'Elastica' ) ) ) {
		$path = str_replace( '\\', '/', $class_name );
		include_once dirname( __FILE__ ) . '/Elastica/lib/' . $path . '.php';
	}
}
spl_autoload_register( 'elastica_autoload' );
  • Custom round robin ES node selection built by extending the Elastica client. We track stats on errors, number of operations, and mark nodes as down for a few minutes if certain errors occur (eg connection errors).
  • Some customizations for node recovery. See my post on speeding up restart time.

Well, that’s not so hard…

If you’ve worked with Elasticsearch a bit then most of those settings probably seem fairly expected.

Elasticsearch scaling is deceptively easy when you first start out. Before building this cluster we ran a two node cluster for over a year and were indexing a few hundred very active sites. We threw more and more data at that small cluster, and more and more queries, and the cluster hardly blinked. It has about 50 million documents on it now, and gets 1-2 million queries a day.

It is still pretty shocking to me how much harder it was to scale a 10-100x larger cluster. There is a substantial and humbling difference between running a small cluster with 10s of GB vs a cluster with 9 TB of data.

In the next part in this series I’ll talk about how we overcame the indexing problems we ran into: how our index structure scales over time, our optimizations for real time indexing, and how we handle indexing failures.

Managing Elasticsearch Cluster Restart Time

While building a fairly large index (8TB total for 500 million docs), I ran into some very long restart times for the cluster. That prompted me to start a discussion about long restart times. There’s some good discussion in that thread, and I wanted to write a post to summarize what we are doing to deal with long restart times.

By “long restart times”, I don’t mean that Elasticsearch didn’t start up quickly, but rather it spent a very long time recovering shards. In my logs I would see messages such as:

recovered_files [399] with total_size of [42.2gb], took [12.4m], throttling_wait [0s]#012         : reusing_files   [0] with total_size of [0b]

All of the data for a 42 GB shard was being recovered from one of the peer nodes rather than from the local disk.

In that ES user group thread, Zachary Tong has a good example and description of why Elasticsearch nodes can have such long restart times. The key point is:

The segment creation and merging process is not deterministic between nodes.

This means that as indexing occurs the segments for the same shard on different nodes will necessarily diverge from each other. When shard recovery occurs (such as during a restart) the segments that are different will need to be copied over the network rather than recovered from the disk.

We’ve put in place a couple of practices to try and minimize how much impact these slow restarts have on us.

First, some background on our system:

  • 500 million documents in 175 shards (about 8TB including replication)
  • 1.5 million new docs a day.
  • Heavy reindexing/updating of newer documents running 24/7, but 99% of older documents never change.
  • Bulk indexing is only ever performed when adding new fields to the index or adding new features.
  • With our recent launch of Related Posts we peak at about a million queries an hour. (More on scaling in a future post.)

Our current methods of minimizing cluster restart times:

1. After bulk indexing we perform the following:
  • Optimize all indices and set the max segments to 5.
  • Perform a rolling restart of the cluster (last one took 38 hours to complete)

By optimizing the index into a smaller number of segments we significantly decrease query time for the older documents. Also, since most of our data never changes we ensure that most of our data will be in large segments that should stay in sync across the cluster. They are less likely to be merged because they are already big.

The rolling restart of the cluster after bulk indexing ensures that all nodes have identical segments. By incurring the cost of restarting just after bulk indexing, we ensure that if a real issue comes up later that requires restarting then we will be able to restart more quickly.

Our current rolling restart only restarts a single node at a time. Because we are using shard allocation awareness we could increase the number of nodes we restart at once if we want to reduce the total time to restart the cluster, but that would also reduce our capacity for servicing incoming queries.

2. When doing a rolling restart, disable allocation
curl -XPUT localhost:9200/_cluster/settings -d '{"transient":{"cluster.routing.allocation.disable_allocation": true}}'

This ensures that there will not be a lot of thrashing of shards around the cluster as the nodes are restarted. Otherwise when we shutdown a node the cluster will try and try and allocate the shards on that node onto other nodes. Once the node is back up, you can then set this value to false.

3. Use noop Linux scheduling  on SSDs rather then CFQ for a significant speed up

When we tested making this update we saw the node restart time (from shutting the node down to all shards being recovered) drop from an average of 150 seconds/node to 96 seconds/node. This was for the case where there was very little difference between the shards on different nodes. When you are doing a rolling restart of 30 nodes, that’s a really big difference. Props to Github for investigating the performance impact of the scheduler.

4. Increase the default recovery limits
cluster.routing.allocation.node_initial_primaries_recoveries: 4
cluster.routing.allocation.node_concurrent_recoveries: 15
indices.recovery.max_bytes_per_sec: 100mb
indices.recovery.concurrent_streams: 5

We’ve tried increasing the max_bytes_per_sec above 100mb, but that runs us into cases where the network traffic starts interfering with the query traffic. You will of course get different results depending on what hardware you are using. In general the ES defaults are set for Amazon EC2, so you can increase your limits a lot if you have your own hardware.

5. Periodic rolling restarts?

One thing I am considering is periodically doing a rolling restart of the cluster. Every few months or so. The only real reason to do this is that it will help me recover faster if I really have to do a restart due to some cluster or hardware failure. Though with the rate that new ES releases occur we’ll probably have a reason to perform such a restart periodically anyways. Not to mention the possibility of bulk reindexing in order to add new features.

I am curious how our restart time will change over time. I would theorize that since most of our data doesn’t change, that data will slowly get accumulated in the older, larger segments while the newer posts will be in the smaller, newer segments. For the most part it will be these newer segments that need to get recovered from the primary shard.

Introducing Related Posts

Pretty big deal to finally get related posts across WordPress.com launched. This is running 1 million Elasticsearch queries an hour against 500 million documents. More details to come…

WordPress.com News

Do you ever wonder what happens when your readers reach the end of your posts? What do they click on? Where do they go next? What if you’ve piqued a reader’s interest and left them wanting more, but don’t give them the option to do so?

Today, we’re so happy to announce Related Posts on WordPress.com — one of the most requested features from users. Now, we’ll search your site for similar posts you’ve written and display a “Related” section at the end of every post, like this:

Related Posts section
Related Posts section

If you post a lot of images, you can jazz this section up with post thumbnails as well. To activate image thumbnails, head to Settings → Reading and scroll down to the “Related Posts” section.

Thumbnails option enabled.
Thumbnails option enabled

It’s mobile ready of course:

Related Posts, mobile view
Related Posts, mobile view

We’re excited about this feature — it allows your readers to dive…

View original post 112 more words

Three Principles for Multilingal Indexing in Elasticsearch

Recently I’ve been working on how to build Elasticsearch indices for WordPress blogs in a way that will work across multiple languages. Elasticsearch has a lot of built in support for different languages, but there are a number of configuration options to wade through and there are a few plugins that improve on the built in support.

Below I’ll lay out the analyzers I am currently using. Some caveats before I start. I’ve done a lot of reading on multi-lingual search, but since I’m really only fluent in one language there’s lots of details about how fluent speakers of other languages use a search engine that I’m sure I don’t understand. This is almost certainly still a work in progress.

In total we have 30 analyzers configured and we’re using the elasticsearch-langdetect plugin to detect 53 languages. For WordPress blogs, users have sometimes set their language to the same language as their content, but very often they have left it as the default of English. So we rely heavily on the language detection plugin to determine which language analyzer to use.

Update: In comments, Michael pointed out that since this post was written the langdetect plugin now has a custom mapping that the mapping example below is not using. I’d highly recommend checking it out for any new implementations.

For configuring the analyzers there are three main principles I’ve pulled from a number of different sources.

1) Use very light or minimal stemming to avoid losing semantic information.

Stemming removes the endings of words to make searches more general, however it can lose a lot of meaning in the process. For instance, the (quite popular) Snowball Stemmer will do the following:

computation -> comput
computers -> comput
computing -> comput
computer -> comput
computes -> comput

international -> intern
internationals -> intern
intern -> intern
interns -> intern

A lot of information is lost in doing such a zealous transformation. There are some cases though where stemming is very helpful. In English, stemming off the plurals of words should rarely be a problem since the plural is still referring to the same concept. This article on SearchWorkings gives further discussion of the pitfalls of the Snowball Stemmer, and leads to Jacque Savoy’s excellent paper on stemming and stop words as applied to French, Italian, German, and Spanish. Savoy found that doing minimal stemming of plurals and feminine/masculine forms of words performed well for these languages. The minimal_* and light_* stemmers included in Elasticsearch implement these recommendations allowing us to take a limited stemming approach.

So when there is a minimal stemmer available for a language we use it, otherwise we do not do any stemming at all.

2) Use stop words for those languages that we have them for.

This ensures that we reduce the size of the index and speed up searches by not trying to match on very frequent terms that provide very little information. Unfortunately, stop words will break certain searches. For instance, searching for “to be or not to be” will not get any results.

The new (to 0.90) cutoff_frequency parameter on the match query may provide a way to allow indexing stop words, but I currently am still unsure whether there are other implications on other types of queries, or how I would decide what cutoff frequency to use given the wide range of documents and languages in a single index. The very high number of English documents compared to say Hebrew also means that Hebrew stopwords may not be frequent enough to trigger the cutoff frequencies correctly if searching across all documents.

For the moment I’m sticking with the stop words approach. Weaning myself off of them will require a bit more experimentation and thought, but I am intrigued by finding an approach that would allow us to avoid the limitations of stop words and enable finding every blog post referencing Shakespeare’s most famous quote.

3) Try and retain term consistency across all analyzers.

We use the ICU Tokenizer for all cases where the language won’t do significantly better with a custom tokenizer. Japanese, Chinese, and Korean all require smarter tokenization, but using the ICU Tokenizer ensures we treat other languages in a consistent manner. Individual terms are then filtered using the ICU Folding and Normalization filters to ensure consistent terms.

Folding converts a character to an equivalent standard form. The most common conversion that ICU Folding provides is converting characters to lower case as defined in this exhaustive definition of case folding. But folding goes far beyond lowercasing, there are symbols in many languages where multiple characters essentially mean the same thing (particularly from a search perspective). UTR30-4 defines the full set of foldings that the ICU Folding performs.

Where Folding converts a single character to a standard form, Normalization converts a sequence of characters to a standard form. A good example of this, straight from Wikipedia, is “the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet).” Another entertaining example of character normalization is that some Roman numerals (Ⅸ) can be expressed as a single UTF-8 character. But of course for search you’d rather have that converted to “IX”. The ICU Normalization sections have links to the many docs defining how normalization is handled.

By indexing using these ICU tools we can be fairly sure that searching across all documents, regardless of language, with just a default analyzer will give results for most queries.

The Details (there’s always exceptions to rules)

  • Asian languages that do not use whitespace for word separations present a non-trivial problem when indexing content. ES comes with the built in CJK analyzer that indexes every pair of symbols into a term, but there are plugins that are much smarter about how to tokenize the text.
    • For Japanese (ja) we are using the Kuromoji plugin built on top of the seemingly excellent library by Atilika. I don’t know any Japanese, so really I am probably just impressed by their level of documentation, slick website, and the fact that they have an online tokenizer for testing tokenization.
    • There are a couple of different versions of written Chinese (zh), and the language detection plugin distinguishes between zh-tw and zh-cn. For analysis we use the ES Smart Chinese Analyzer for all versions of the language. This is done out of necessity rather than any analysis on my part. The ES plugin wraps the Lucene analyzer which performs sentence and then word segmentation using a Hidden Markov Model.
    • Unfortunately there is currently no custom Korean analyzer for Elasticsearch that I have come across. For that reason we are only using the CJK Analyzer which takes each bi-gram of symbols as a term. However, while writing this post I came across a Lucene mailing list thread from a few days ago which says that a Korean analyzer is in the process of being ported into Lucene. So I have no doubt that will eventually end up in ES or as an ES plugin.
  • Elasticsearch doesn’t have any built in stop words for Hebrew (he) so we define a custom list pulled from an online list (Update: this site doesn’t exist anymore, our list of stopwords is located here). I had some co-workers cull the list a bit to remove a few of the terms that they deemed a bit redundant. I’ll probably end up doing this for some other languages as well if we stick with the stop words approach.
  • Testing 30 analyzers was pretty non-trivial. The ES Inquisitor plugin’s Analyzers tab was incredibly useful for interactively testing text tokenization and stemming against all the different language analyzers to see how they functioned differently.

Finally we come to defining all of these analyzers. Hope this helps you in your multi-lingual endeavors.

Update [Feb 2014]: The PHP code we use for generating analyzers is now open sourced as a part of the wpes-lib project. See that code for the latest methods we are using.

Update [May 2014]: Based on the feedback in the comments and some issues we’ve come across running in production I’ve updated the mappings below. The changes we made are:

  • Perform ICU normalization before removing stopwords, and ICU folding after stopwords. Otherwise stopwords such as “même” in French will not be correctly removed.
  • Adjusted our Japanese language analysis based on a slightly adjusted use of GMO Media’s methodology. We were seeing a significantly lower click through rate on Japanese related posts than for other languages, and there was pretty good evidence that the morphological language analysis would help.
  • Added the Elision Token filter to French. “l’avion” => “avion”

Potential improvements I haven’t gotten a chance to test yet because we need to run real performance tests to be sure they will actually be an improvement:

  • Duplicate tokens to handle different spellings (eg “recognize” vs “recognise”).
  • Morphological analysis of en and ru
  • Should we run spell checking or phonetic analysis
  • Include all stopwords and rely on cutoff_frequency to avoid the performance problems this will introduce
  • Index bigrams with the shingle analyzer
  • Duplicate terms, stem them, then unique the terms to try and index both stemmed and non-stemmed terms

Thanks to everyone in the comments who have helped make our multi-lingual indexing better.

{
  "filter": {
    "ar_stop_filter": {
      "type": "stop",
      "stopwords": ["_arabic_"]
    },
    "bg_stop_filter": {
      "type": "stop",
      "stopwords": ["_bulgarian_"]
    },
    "ca_stop_filter": {
      "type": "stop",
      "stopwords": ["_catalan_"]
    },
    "cs_stop_filter": {
      "type": "stop",
      "stopwords": ["_czech_"]
    },
    "da_stop_filter": {
      "type": "stop",
      "stopwords": ["_danish_"]
    },
    "de_stop_filter": {
      "type": "stop",
      "stopwords": ["_german_"]
    },
    "de_stem_filter": {
      "type": "stemmer",
      "name": "minimal_german"
    },
    "el_stop_filter": {
      "type": "stop",
      "stopwords": ["_greek_"]
    },
    "en_stop_filter": {
      "type": "stop",
      "stopwords": ["_english_"]
    },
    "en_stem_filter": {
      "type": "stemmer",
      "name": "minimal_english"
    },
    "es_stop_filter": {
      "type": "stop",
      "stopwords": ["_spanish_"]
    },
    "es_stem_filter": {
      "type": "stemmer",
      "name": "light_spanish"
    },
    "eu_stop_filter": {
      "type": "stop",
      "stopwords": ["_basque_"]
    },
    "fa_stop_filter": {
      "type": "stop",
      "stopwords": ["_persian_"]
    },
    "fi_stop_filter": {
      "type": "stop",
      "stopwords": ["_finnish_"]
    },
    "fi_stem_filter": {
      "type": "stemmer",
      "name": "light_finish"
    },
    "fr_stop_filter": {
      "type": "stop",
      "stopwords": ["_french_"]
    },
    "fr_stem_filter": {
      "type": "stemmer",
      "name": "minimal_french"
    },
    "he_stop_filter": {
      "type": "stop",
      "stopwords": [/*excluded for brevity*/]
    },
    "hi_stop_filter": {
      "type": "stop",
      "stopwords": ["_hindi_"]
    },
    "hu_stop_filter": {
      "type": "stop",
      "stopwords": ["_hungarian_"]
    },
    "hu_stem_filter": {
      "type": "stemmer",
      "name": "light_hungarian"
    },
    "hy_stop_filter": {
      "type": "stop",
      "stopwords": ["_armenian_"]
    },
    "id_stop_filter": {
      "type": "stop",
      "stopwords": ["_indonesian_"]
    },
    "it_stop_filter": {
      "type": "stop",
      "stopwords": ["_italian_"]
    },
    "it_stem_filter": {
      "type": "stemmer",
      "name": "light_italian"
    },
    "ja_pos_filter": {
      "type": "kuromoji_part_of_speech",
      "stoptags": ["\\u52a9\\u8a5e-\\u683c\\u52a9\\u8a5e-\\u4e00\\u822c", "\\u52a9\\u8a5e-\\u7d42\\u52a9\\u8a5e"]
    },
    "nl_stop_filter": {
      "type": "stop",
      "stopwords": ["_dutch_"]
    },
    "no_stop_filter": {
      "type": "stop",
      "stopwords": ["_norwegian_"]
    },
    "pt_stop_filter": {
      "type": "stop",
      "stopwords": ["_portuguese_"]
    },
    "pt_stem_filter": {
      "type": "stemmer",
      "name": "minimal_portuguese"
    },
    "ro_stop_filter": {
      "type": "stop",
      "stopwords": ["_romanian_"]
    },
    "ru_stop_filter": {
      "type": "stop",
      "stopwords": ["_russian_"]
    },
    "ru_stem_filter": {
      "type": "stemmer",
      "name": "light_russian"
    },
    "sv_stop_filter": {
      "type": "stop",
      "stopwords": ["_swedish_"]
    },
    "sv_stem_filter": {
      "type": "stemmer",
      "name": "light_swedish"
    },
    "tr_stop_filter": {
      "type": "stop",
      "stopwords": ["_turkish_"]
    }
  },
  "analyzer": {
    "ar_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ar_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "bg_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "bg_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ca_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ca_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "cs_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "cs_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "da_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "da_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "de_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "de_stop_filter", "de_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "el_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "el_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "en_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "en_stop_filter", "en_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "es_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "es_stop_filter", "es_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "eu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "eu_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fa_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fa_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fi_stop_filter", "fi_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "elision", "fr_stop_filter", "fr_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "he_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "he_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hi_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hu_stop_filter", "hu_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hy_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hy_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "id_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "id_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "it_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "it_stop_filter", "it_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ja_analyzer": {
      "type": "custom",
      "filter": ["kuromoji_baseform", "ja_pos_filter", "icu_normalizer", "icu_folding", "cjk_width"],
      "tokenizer": "kuromoji_tokenizer"
    },
    "ko_analyzer": {
      "type": "cjk",
      "filter": []
    },
    "nl_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "nl_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "no_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "no_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "pt_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "pt_stop_filter", "pt_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ro_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ro_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ru_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ru_stop_filter", "ru_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "sv_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "sv_stop_filter", "sv_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "tr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "tr_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "zh_analyzer": {
      "type": "custom",
      "filter": ["smartcn_word", "icu_normalizer", "icu_folding"],
      "tokenizer": "smartcn_sentence"
    },
    "lowercase_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "keyword"
    },
    "default": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    }
  },
  "tokenizer": {
    "kuromoji": {
      "type": "kuromoji_tokenizer",
      "mode": "search"
    }
  }
}

 

Mapping WordPress Posts to Elasticsearch

I thought I’d share the Elasticsearch type mapping I am using for WordPress posts. We’ve refined it over a number of iterations and it combines dynamic templates and multi_field mappings along with a number of more standard mappings. So this is probably a good general example of how to index real data from a traditional SQL database into Elasticsearch.

If you aren’t familiar with the WordPress database scheme it looks like this:

These Elasticsearch mappings focus on the wp_posts, wp_term_relationships, wp_term_taxonomy, and wp_terms tables.

To simplify things I’ll just index using an English analyzer and leave discussing multi-lingual analyzers to a different post.

"analysis": {
    "filter": {
        "stop_filter": {
            "type": "stop",
            "stopwords": ["_english_"]
        },
        "stemmer_filter": {
            "type": "stemmer",
            "name": "minimal_english"
        }
    },
    "analyzer": {
        "wp_analyzer": {
            "type": "custom",
            "tokenizer": "uax_url_email",
            "filter": ["lowercase", "stop_filter", "stemmer_filter"],
            "char_filter": ["html_strip"]
        },
        "wp_raw_lowercase_analyzer": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": ["lowercase"]
        }
    }
}

A few notes on the analyzers:

  • The minimal_english stemmer only removes plurals rather than potentially butchering the difference between words like “computer”, “computes”, and “computing”.
  • Lowercase keyword analyzer makes doing an exact search without case possible.

Let’s take a look at the post mapping:

"post": {
    "dynamic_templates": [
        {
            "tax_template_name": {
                "path_match": "taxonomy.*.name",
                "mapping": {
                    "type": "multi_field",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer"
                        }
                    }
                }
            }
        }, {
            "tax_template_slug": {
                "path_match": "taxonomy.*.slug",
                "mapping": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        }, {
            "tax_template_term_id": {
                "path_match": "taxonomy.*.term_id",
                "mapping": {
                    "type": "long"
                }
            }
        }
    ],
    "_all": {
        "enabled": false
    },
    "properties": {
        "post_id": {
            "type": "long"
        },
        "blog_id": {
            "type": "long"
        },
        "site_id": {
            "type": "long"
        },
        "post_type": {
            "type": "string",
            "index": "not_analyzed"
        },
        "lang": {
            "type": "string",
            "index": "not_analyzed"
        },
        "url": {
            "type": "string",
            "index": "not_analyzed"
        },
        "location": {
            "type": "geo_point",
            "lat_lon": true
        },
        "date": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"
        },
        "date_gmt": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"
        },
        "author": {
            "type": "multi_field",
            "fields": {
                "author": {
                    "type": "string",
                    "index": "analyzed",
                    "analyzer": "wp_analyzer"
                },
                "raw": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        },
        "author_login": {
            "type": "string",
            "index": "not_analyzed"
        },
        "title": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "wp_analyzer"
        },
        "content": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "wp_analyzer"
        },
        "tag": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer",
                            "index_name": "tag"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed",
                            "index_name": "tag.raw"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer",
                            "index_name": "tag.raw_lc"
                        }
                    }
                },
                "slug": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "term_id": {
                    "type": "long"
                }
            }
        },
        "category": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer",
                            "index_name": "category"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed",
                            "index_name": "category.raw"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer",
                            "index_name": "category.raw_lc"
                        }
                    }
                },
                "slug": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "term_id": {
                    "type": "long"
                }
            }
        },
    }
}

Most of the fields are pretty self explanatory, so I’ll just outline to more complex ones:

  • date and date_gmt: We define the allowed formats because we are taking the dates out of MySQL. We also do some checking of the dates since MySQL will allow some things in a DATETIME field that ES will balk at and cause the indexing operation to fail. For instance MySQL accepts leap dates in non-leap years.
  • content: Content gets stripped of HTML and shortcodes, then converted to UTF-8 in cases where it isn’t already.
  • author and author.raw: The author field corresponds to the user’s display_name. Clearly we need to analyze the field so “Greg Ichneumon Brown” can be matched on a search for “Greg”, but what about when we facet on the field. If we use the analyzed field then the results would have the terms “greg”, “ichneumon”, and “brown”. Instead, by using ES’s multi_field mapping feature to auto generate author.raw the faceted results on that field will give us “Greg Ichneumon Brown”.
  • tag and category: Tags and Categories similarly need raw versions for faceting so we preserve the original tag. Additionally there are a number of ways users can filter the content. WordPress builds slugs from each category/tag to uniquely identify them in a human readable way and there is a unique integer (term_id) associated with each term. The tag.raw_lc is used for exact matching a term without worrying about the case. This may seem like a lot of duplication, but the overriding goal here is to avoid using MySQL for search so we index everything. Extracting data into multiple fields ensures that we will have flexibility when filtering the data in the future.
  • taxonomy.*: WordPress allows custom taxonomies (of which categories and tags are two built-in taxonomies) so we need a way to create a custom path in each document that allows access to each taxonomy. This is where Elasticsearch’s dynamic templates shine. For a custom taxonomy such as “company” the paths will become taxonomy.company.name, taxonomy.company.nametaxonomy.company.name.raw, taxonomy.company.slug, and taxonomy.company.term_id.

The ES documentation is very complete, but it’s not always easy to see how to build complex mappings that fit the individual pieces together. I hope this helps in your own ES development efforts.

Building Word Clouds with Faceted Search

Elasticsearch’s faceted results are a great way to analyze the contents of a set of documents. For over a year now, Polldaddy has used Elasticsearch to create reports for the most popular answers and words given to free text survey responses. For more details take a look at the feature announcement.

However, running faceted search on such a wide array of user data can be difficult. Faceted Search in Elasticsearch can consume a lot of memory which leads to the suggestion in the ES documentation to “make sure the number of unique tokens a field can have is not large”. To make sure that we can accept any arbitrary user input we use a couple of tricks.

First let’s take a look at the mapping we use for documents in the polldaddy-survey index:

{
  "polldaddy-survey" : {
    "freetext" : {
      "_routing" : {
        "required" : true,
        "path" : "survey_id"
      },
      "properties" : {
        "resp_id" : {
          "type" : "long",
        },
        "survey_id" : {
          "type" : "long",
        },
        "text" : {
          "type" : "string",
          "analyzer" : "analyzer_polldaddy",
        },
        "text_string" : {
          "type" : "string",
          "index" : "not_analyzed",
        }
      }
    }
  }
}

Each user’s response is its own document where the response is stored in its analyzed form in the text field and unanalyzed in the text_string field. (I should have come up with better names.) By doing a terms facet query on these two fields we can get the most popular words and most popular answers respectively.

However, blindly doing this across the many millions of responses in Polldaddy would run into some serious memory problems due to the overall size of the vocabularies in those fields. For that reason we are using _routing to make sure that all documents related to a single survey go to the same shard. We then allocate a very large number of shards (100 in our case) to limit the number of unique terms in each shard. By routing our query only to one shard the amount of memory that needs to be allocated is greatly reduced, and we can even handle surveys with decent-sized vocabularies.

So just how important is routing to a single shard, well a bug snuck into our code at one point and disabled the routing. Here’s what happened to the cache memory consumption from when it was broken to when it got fixed.

Boom! Fixed a bug.

A pretty dramatic change. Without the routing to a single shard the cache would occasionally try to load a very large vocabulary and allocate 10+ GB. This of course would slow down all queries on the server.

But it could have been worse. Commenting on a previous post on this site Bruno asked me why I suggested setting index.cache.field.type: soft given that it reduces the caching performance. This memory consumption activity is why. Before setting the field cache to soft (and before adding the routing) these queries would sometimes consume so much memory that we would run out of the 24GB we had allocated on the servers. In fact it seemed like no matter how much memory we gave the servers, they would use it and cause OutOfMemory errors that would bring the cluster to a painful halt. Setting the field cache to soft is really the only way I can ensure that we won’t hit those conditions regardless of what data gets entered into a poll.

I will relish the day when there’s a good bug fix in ES for the term facet memory consumption (Issue #1531). There’s so many great applications that can be built on top of faceted searches, I’d just love to not have to worry about running out of memory because of a stray query.

A big thanks to Shay Bannon (lead ES developer) who originally suggested that I use routing and a large number of shards. And of course many thanks to my colleagues on the Polldaddy team.

Elasticsearch: Five Things I was Doing Wrong

Update: Also check out my series on scaling Elasticsearch.

I’ve been working with Elasticsearch off and on for over a year, but recently I attended Elasticsearch.com’s training class (well worth the time and money) and discovered a few significant things that I was doing just plain wrong.

Before using Elasticsearch I used Lucene directly, and so a few of the errors I made were due to not understanding some of the things ES does for you behind the scenes.

As background, most of the data I’m indexing conforms to the WordPress database schema.

1. Use Arrays for Fields with Multiple Values

For some reason I had neglected to use arrays when creating fileds such as a list of tags attached to a document. At some point I started concatenating the tags together into a long string separated by semicolons and I used a custom analyzer to break them apart like this:

"analysis" : {
  "tokenizer" : {
    "semicolon_token" : {
      "type" => "pattern",
      "pattern" => ";"
  } },
  "analyzer" : {
    "wp_tag_analyzer" : {
      "type" => "custom",
      "tokenizer" => "semicolon_token",
  } }
}

Or, for fields that were lists of URLs I just separated them by spaces and used the whitespace analyzer. Both methods worked fine for the initial applications, but have some obvious drawbacks. Explicitly inserting a character sequence as a delimiter will almost always means you will hit an edge case somewhere down the road where it will break.

Using an array of items is a much easier way, but somehow, after initially reading about the array mapping, I completely forgot that it existed. I think I was thinking of ES too much as a text searching engine and not enough as a general JSON data store.

2. Don’t Use store=true When Mapping Fields

If you are storing the full _source of the document, then there is very little reason to store individual fields separately. You just inflate your index size. I originally started storing the content and titles of documents because I thought it might speed up the highlighting. In practice, I don’t think it did anything for me, and many of our queries don’t do any highlighting at all.

In the end this was a case of premature optimization. Maybe at some point if I find that 90% of the time we are just returning the post_id and using that to lookup the original content in MySQL we’ll consider storing that separately to reduce network traffic and load caused by extracting the post_id field from _source, but that still feels premature at this point.

For debugging reasons I would never consider turning off storing _source. It is far too useful to know exactly what data was entered, and you never know when you might want to use a different field for a new application.

3. Don’t Manually Flush, Optimize, or Refresh

Elasticsearch takes care of these core Lucene operations for me, there was never any good reason for me to issue one of these commands when the default ES settings would accomplish it within a few minutes.

The optimize command in particular is dangerous since it merges all segments in the Lucene index (a very time consuming operation). The code I wrote which at first was issuing innocuous optimize commands after doing some bulk indexing by hand eventually started getting called repeatedly in automated jobs. Fortunately it never rose to a level of causing real problems, but its easy for code you write to get unintentionally called.

Again, this was a case of premature optimization.

4. Set the Appropriate Production Flags

This is another case that didn’t cause a real issue, but could have in the future. The default settings for ES are set to ensure it works to quickly start development. This means that a few of the default settings are not what you want when in production. In particular:

  • discovery.zen.minimum_master_nodes
    • Should be set to something like N/2 + 1 where N is the number of available master nodes.
  • action.disable_delete_all_indices
    • Do you really want to allow a single command (that could be mistyped) to delete all of your indices? No, neither do I.
  • gateway.recover_after_nodes
    • How many nodes need to be up before the recovery process starts replicating data around the cluster.
  • index.cache.field.type: soft (in 0.90 this field name changed to index.fielddata.cache. Thanks Olivier for the heads up.)
    • I started setting my field cache to soft to ensure that it never created OutOfMemory errors. I think this was particularly helpful because we are doing a lot of faceting.
    • Update 2014-01-09: the indices.fielddata.cache.size setting introduced in 0.90 is a better way to prevent running into OutOfMemory exceptions due to the field cache getting too big. I am no longer using the soft field data cache.

5. Do Not Use _type as Another Field

The _type field can entice you to use it as another field to indicate a category for your document. Don’t let it.

Here’s where I went wrong. WordPress posts can have different types (post_type) which allow displaying the content of the post in different ways (e.g. image posts, video posts, quotes, a status message). This despite the different post types all using the same schema. This seemed to align pretty well with the _type fields so I used an ES dynamic mapping to have post_type == _type.

The biggest problem with this is how do you determine the document’s _type after a post has been deleted from the database and you want to also delete it from your index. A document is uniquely identified both by its _id and its _type.

  • If you delete from your RDBMS first (or NoSQL data store flavor of the month), then you may no longer have the _type available to delete the object.
  • If you delete from ES first then what if the RDBMS delete operation fails for some reason.

Making the _type independent of any data within the document ensures that all you will need is the document id. This was one of those “Oh, that was dumb of me” bugs that I completely missed when building my index.