Mapping WordPress Posts to Elasticsearch

I thought I’d share the Elasticsearch type mapping I am using for WordPress posts. We’ve refined it over a number of iterations and it combines dynamic templates and multi_field mappings along with a number of more standard mappings. So this is probably a good general example of how to index real data from a traditional SQL database into Elasticsearch.

If you aren’t familiar with the WordPress database scheme it looks like this:

These Elasticsearch mappings focus on the wp_posts, wp_term_relationships, wp_term_taxonomy, and wp_terms tables.

To simplify things I’ll just index using an English analyzer and leave discussing multi-lingual analyzers to a different post.

"analysis": {
    "filter": {
        "stop_filter": {
            "type": "stop",
            "stopwords": ["_english_"]
        },
        "stemmer_filter": {
            "type": "stemmer",
            "name": "minimal_english"
        }
    },
    "analyzer": {
        "wp_analyzer": {
            "type": "custom",
            "tokenizer": "uax_url_email",
            "filter": ["lowercase", "stop_filter", "stemmer_filter"],
            "char_filter": ["html_strip"]
        },
        "wp_raw_lowercase_analyzer": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": ["lowercase"]
        }
    }
}

A few notes on the analyzers:

  • The minimal_english stemmer only removes plurals rather than potentially butchering the difference between words like “computer”, “computes”, and “computing”.
  • Lowercase keyword analyzer makes doing an exact search without case possible.

Let’s take a look at the post mapping:

"post": {
    "dynamic_templates": [
        {
            "tax_template_name": {
                "path_match": "taxonomy.*.name",
                "mapping": {
                    "type": "multi_field",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer"
                        }
                    }
                }
            }
        }, {
            "tax_template_slug": {
                "path_match": "taxonomy.*.slug",
                "mapping": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        }, {
            "tax_template_term_id": {
                "path_match": "taxonomy.*.term_id",
                "mapping": {
                    "type": "long"
                }
            }
        }
    ],
    "_all": {
        "enabled": false
    },
    "properties": {
        "post_id": {
            "type": "long"
        },
        "blog_id": {
            "type": "long"
        },
        "site_id": {
            "type": "long"
        },
        "post_type": {
            "type": "string",
            "index": "not_analyzed"
        },
        "lang": {
            "type": "string",
            "index": "not_analyzed"
        },
        "url": {
            "type": "string",
            "index": "not_analyzed"
        },
        "location": {
            "type": "geo_point",
            "lat_lon": true
        },
        "date": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"
        },
        "date_gmt": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"
        },
        "author": {
            "type": "multi_field",
            "fields": {
                "author": {
                    "type": "string",
                    "index": "analyzed",
                    "analyzer": "wp_analyzer"
                },
                "raw": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        },
        "author_login": {
            "type": "string",
            "index": "not_analyzed"
        },
        "title": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "wp_analyzer"
        },
        "content": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "wp_analyzer"
        },
        "tag": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer",
                            "index_name": "tag"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed",
                            "index_name": "tag.raw"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer",
                            "index_name": "tag.raw_lc"
                        }
                    }
                },
                "slug": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "term_id": {
                    "type": "long"
                }
            }
        },
        "category": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer",
                            "index_name": "category"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed",
                            "index_name": "category.raw"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer",
                            "index_name": "category.raw_lc"
                        }
                    }
                },
                "slug": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "term_id": {
                    "type": "long"
                }
            }
        },
    }
}

Most of the fields are pretty self explanatory, so I’ll just outline to more complex ones:

  • date and date_gmt: We define the allowed formats because we are taking the dates out of MySQL. We also do some checking of the dates since MySQL will allow some things in a DATETIME field that ES will balk at and cause the indexing operation to fail. For instance MySQL accepts leap dates in non-leap years.
  • content: Content gets stripped of HTML and shortcodes, then converted to UTF-8 in cases where it isn’t already.
  • author and author.raw: The author field corresponds to the user’s display_name. Clearly we need to analyze the field so “Greg Ichneumon Brown” can be matched on a search for “Greg”, but what about when we facet on the field. If we use the analyzed field then the results would have the terms “greg”, “ichneumon”, and “brown”. Instead, by using ES’s multi_field mapping feature to auto generate author.raw the faceted results on that field will give us “Greg Ichneumon Brown”.
  • tag and category: Tags and Categories similarly need raw versions for faceting so we preserve the original tag. Additionally there are a number of ways users can filter the content. WordPress builds slugs from each category/tag to uniquely identify them in a human readable way and there is a unique integer (term_id) associated with each term. The tag.raw_lc is used for exact matching a term without worrying about the case. This may seem like a lot of duplication, but the overriding goal here is to avoid using MySQL for search so we index everything. Extracting data into multiple fields ensures that we will have flexibility when filtering the data in the future.
  • taxonomy.*: WordPress allows custom taxonomies (of which categories and tags are two built-in taxonomies) so we need a way to create a custom path in each document that allows access to each taxonomy. This is where Elasticsearch’s dynamic templates shine. For a custom taxonomy such as “company” the paths will become taxonomy.company.name, taxonomy.company.nametaxonomy.company.name.raw, taxonomy.company.slug, and taxonomy.company.term_id.

The ES documentation is very complete, but it’s not always easy to see how to build complex mappings that fit the individual pieces together. I hope this helps in your own ES development efforts.

Building Word Clouds with Faceted Search

Elasticsearch’s faceted results are a great way to analyze the contents of a set of documents. For over a year now, Polldaddy has used Elasticsearch to create reports for the most popular answers and words given to free text survey responses. For more details take a look at the feature announcement.

However, running faceted search on such a wide array of user data can be difficult. Faceted Search in Elasticsearch can consume a lot of memory which leads to the suggestion in the ES documentation to “make sure the number of unique tokens a field can have is not large”. To make sure that we can accept any arbitrary user input we use a couple of tricks.

First let’s take a look at the mapping we use for documents in the polldaddy-survey index:

{
  "polldaddy-survey" : {
    "freetext" : {
      "_routing" : {
        "required" : true,
        "path" : "survey_id"
      },
      "properties" : {
        "resp_id" : {
          "type" : "long",
        },
        "survey_id" : {
          "type" : "long",
        },
        "text" : {
          "type" : "string",
          "analyzer" : "analyzer_polldaddy",
        },
        "text_string" : {
          "type" : "string",
          "index" : "not_analyzed",
        }
      }
    }
  }
}

Each user’s response is its own document where the response is stored in its analyzed form in the text field and unanalyzed in the text_string field. (I should have come up with better names.) By doing a terms facet query on these two fields we can get the most popular words and most popular answers respectively.

However, blindly doing this across the many millions of responses in Polldaddy would run into some serious memory problems due to the overall size of the vocabularies in those fields. For that reason we are using _routing to make sure that all documents related to a single survey go to the same shard. We then allocate a very large number of shards (100 in our case) to limit the number of unique terms in each shard. By routing our query only to one shard the amount of memory that needs to be allocated is greatly reduced, and we can even handle surveys with decent-sized vocabularies.

So just how important is routing to a single shard, well a bug snuck into our code at one point and disabled the routing. Here’s what happened to the cache memory consumption from when it was broken to when it got fixed.

Boom! Fixed a bug.

A pretty dramatic change. Without the routing to a single shard the cache would occasionally try to load a very large vocabulary and allocate 10+ GB. This of course would slow down all queries on the server.

But it could have been worse. Commenting on a previous post on this site Bruno asked me why I suggested setting index.cache.field.type: soft given that it reduces the caching performance. This memory consumption activity is why. Before setting the field cache to soft (and before adding the routing) these queries would sometimes consume so much memory that we would run out of the 24GB we had allocated on the servers. In fact it seemed like no matter how much memory we gave the servers, they would use it and cause OutOfMemory errors that would bring the cluster to a painful halt. Setting the field cache to soft is really the only way I can ensure that we won’t hit those conditions regardless of what data gets entered into a poll.

I will relish the day when there’s a good bug fix in ES for the term facet memory consumption (Issue #1531). There’s so many great applications that can be built on top of faceted searches, I’d just love to not have to worry about running out of memory because of a stray query.

A big thanks to Shay Bannon (lead ES developer) who originally suggested that I use routing and a large number of shards. And of course many thanks to my colleagues on the Polldaddy team.

Elasticsearch: Five Things I was Doing Wrong

Update: Also check out my series on scaling Elasticsearch.

I’ve been working with Elasticsearch off and on for over a year, but recently I attended Elasticsearch.com’s training class (well worth the time and money) and discovered a few significant things that I was doing just plain wrong.

Before using Elasticsearch I used Lucene directly, and so a few of the errors I made were due to not understanding some of the things ES does for you behind the scenes.

As background, most of the data I’m indexing conforms to the WordPress database schema.

1. Use Arrays for Fields with Multiple Values

For some reason I had neglected to use arrays when creating fileds such as a list of tags attached to a document. At some point I started concatenating the tags together into a long string separated by semicolons and I used a custom analyzer to break them apart like this:

"analysis" : {
  "tokenizer" : {
    "semicolon_token" : {
      "type" => "pattern",
      "pattern" => ";"
  } },
  "analyzer" : {
    "wp_tag_analyzer" : {
      "type" => "custom",
      "tokenizer" => "semicolon_token",
  } }
}

Or, for fields that were lists of URLs I just separated them by spaces and used the whitespace analyzer. Both methods worked fine for the initial applications, but have some obvious drawbacks. Explicitly inserting a character sequence as a delimiter will almost always means you will hit an edge case somewhere down the road where it will break.

Using an array of items is a much easier way, but somehow, after initially reading about the array mapping, I completely forgot that it existed. I think I was thinking of ES too much as a text searching engine and not enough as a general JSON data store.

2. Don’t Use store=true When Mapping Fields

If you are storing the full _source of the document, then there is very little reason to store individual fields separately. You just inflate your index size. I originally started storing the content and titles of documents because I thought it might speed up the highlighting. In practice, I don’t think it did anything for me, and many of our queries don’t do any highlighting at all.

In the end this was a case of premature optimization. Maybe at some point if I find that 90% of the time we are just returning the post_id and using that to lookup the original content in MySQL we’ll consider storing that separately to reduce network traffic and load caused by extracting the post_id field from _source, but that still feels premature at this point.

For debugging reasons I would never consider turning off storing _source. It is far too useful to know exactly what data was entered, and you never know when you might want to use a different field for a new application.

3. Don’t Manually Flush, Optimize, or Refresh

Elasticsearch takes care of these core Lucene operations for me, there was never any good reason for me to issue one of these commands when the default ES settings would accomplish it within a few minutes.

The optimize command in particular is dangerous since it merges all segments in the Lucene index (a very time consuming operation). The code I wrote which at first was issuing innocuous optimize commands after doing some bulk indexing by hand eventually started getting called repeatedly in automated jobs. Fortunately it never rose to a level of causing real problems, but its easy for code you write to get unintentionally called.

Again, this was a case of premature optimization.

4. Set the Appropriate Production Flags

This is another case that didn’t cause a real issue, but could have in the future. The default settings for ES are set to ensure it works to quickly start development. This means that a few of the default settings are not what you want when in production. In particular:

  • discovery.zen.minimum_master_nodes
    • Should be set to something like N/2 + 1 where N is the number of available master nodes.
  • action.disable_delete_all_indices
    • Do you really want to allow a single command (that could be mistyped) to delete all of your indices? No, neither do I.
  • gateway.recover_after_nodes
    • How many nodes need to be up before the recovery process starts replicating data around the cluster.
  • index.cache.field.type: soft (in 0.90 this field name changed to index.fielddata.cache. Thanks Olivier for the heads up.)
    • I started setting my field cache to soft to ensure that it never created OutOfMemory errors. I think this was particularly helpful because we are doing a lot of faceting.
    • Update 2014-01-09: the indices.fielddata.cache.size setting introduced in 0.90 is a better way to prevent running into OutOfMemory exceptions due to the field cache getting too big. I am no longer using the soft field data cache.

5. Do Not Use _type as Another Field

The _type field can entice you to use it as another field to indicate a category for your document. Don’t let it.

Here’s where I went wrong. WordPress posts can have different types (post_type) which allow displaying the content of the post in different ways (e.g. image posts, video posts, quotes, a status message). This despite the different post types all using the same schema. This seemed to align pretty well with the _type fields so I used an ES dynamic mapping to have post_type == _type.

The biggest problem with this is how do you determine the document’s _type after a post has been deleted from the database and you want to also delete it from your index. A document is uniquely identified both by its _id and its _type.

  • If you delete from your RDBMS first (or NoSQL data store flavor of the month), then you may no longer have the _type available to delete the object.
  • If you delete from ES first then what if the RDBMS delete operation fails for some reason.

Making the _type independent of any data within the document ensures that all you will need is the document id. This was one of those “Oh, that was dumb of me” bugs that I completely missed when building my index.

Quickly Build Faceted Search with ElasticSearch and Backbone.js

I’ve been working with ElasticSearch off and on for the past year, and recently I’ve done a lot of work using Backbone.js to build interactive elements for web pages. Time to combine them together into a modular library: es-backbone.js.

Faceted search is one of the more powerful aspects of ElasticSearch. For es-backbone I was inspired by Karel Minarik’s very cool data visualization example. Initially I started from his implementation, then veered away from Protovis to use jQuery Flot for the charting when I realized my Javascript graphics abilities were not up to making use of Protovis. Flot makes it very fast to build and customize charts.

But managing all the data that comes back with your faceted search results, displaying it for the user, and allowing them to interact with the data and filter it further is also a bit of a headache. Using Backbone to keep a model of the current query the user is doing and another model of the search results helps keep the data well-organized and easy to update. By creating highly modular Backbone Views I can quickly customize a search page depending on what fields the data contains and how we want to display it.

I don’t have any public ElasticSearch data for a good demo, so a screenshot will have to do:

Each part of the page is a separate Backbone View, which allows you to customize the page very quickly based on what data you have. Think that pie chart of authors is too busy, replacing it with a list of the top sites (similar to the tags list) is only a few lines of code. Currently the library has Views for displaying facets as:

  • A pie chart of terms or ranges
  • A list of terms (with counts and percentages)
  • A timeline of dates (which auto switches scales between months, weeks, and days )

All of these views will re-filter your results when you click on the pie/chart/list so you can drill down into your search results. And I used Select2.js in tagging mode to make it easy to add and remove filters. I think the interaction is pretty cool, and wish I had some live public data to show it off on.

The library definitely allows new sites to be built very fast. Once the data is indexed, you can build a site in less than an hour (my last one took 35 min). I’ve released it on github in the hopes that others will find it useful (patches welcome). I’ve included a simple example which you can grep through for “TODO” to find the parts you need to edit to customize for your own application.