Mapping WordPress Posts to Elasticsearch

I thought I’d share the Elasticsearch type mapping I am using for WordPress posts. We’ve refined it over a number of iterations and it combines dynamic templates and multi_field mappings along with a number of more standard mappings. So this is probably a good general example of how to index real data from a traditional SQL database into Elasticsearch.

If you aren’t familiar with the WordPress database scheme it looks like this:

These Elasticsearch mappings focus on the wp_posts, wp_term_relationships, wp_term_taxonomy, and wp_terms tables.

To simplify things I’ll just index using an English analyzer and leave discussing multi-lingual analyzers to a different post.

"analysis": {
    "filter": {
        "stop_filter": {
            "type": "stop",
            "stopwords": ["_english_"]
        },
        "stemmer_filter": {
            "type": "stemmer",
            "name": "minimal_english"
        }
    },
    "analyzer": {
        "wp_analyzer": {
            "type": "custom",
            "tokenizer": "uax_url_email",
            "filter": ["lowercase", "stop_filter", "stemmer_filter"],
            "char_filter": ["html_strip"]
        },
        "wp_raw_lowercase_analyzer": {
            "type": "custom",
            "tokenizer": "keyword",
            "filter": ["lowercase"]
        }
    }
}

A few notes on the analyzers:

  • The minimal_english stemmer only removes plurals rather than potentially butchering the difference between words like “computer”, “computes”, and “computing”.
  • Lowercase keyword analyzer makes doing an exact search without case possible.

Let’s take a look at the post mapping:

"post": {
    "dynamic_templates": [
        {
            "tax_template_name": {
                "path_match": "taxonomy.*.name",
                "mapping": {
                    "type": "multi_field",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer"
                        }
                    }
                }
            }
        }, {
            "tax_template_slug": {
                "path_match": "taxonomy.*.slug",
                "mapping": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        }, {
            "tax_template_term_id": {
                "path_match": "taxonomy.*.term_id",
                "mapping": {
                    "type": "long"
                }
            }
        }
    ],
    "_all": {
        "enabled": false
    },
    "properties": {
        "post_id": {
            "type": "long"
        },
        "blog_id": {
            "type": "long"
        },
        "site_id": {
            "type": "long"
        },
        "post_type": {
            "type": "string",
            "index": "not_analyzed"
        },
        "lang": {
            "type": "string",
            "index": "not_analyzed"
        },
        "url": {
            "type": "string",
            "index": "not_analyzed"
        },
        "location": {
            "type": "geo_point",
            "lat_lon": true
        },
        "date": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"
        },
        "date_gmt": {
            "type": "date",
            "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd"
        },
        "author": {
            "type": "multi_field",
            "fields": {
                "author": {
                    "type": "string",
                    "index": "analyzed",
                    "analyzer": "wp_analyzer"
                },
                "raw": {
                    "type": "string",
                    "index": "not_analyzed"
                }
            }
        },
        "author_login": {
            "type": "string",
            "index": "not_analyzed"
        },
        "title": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "wp_analyzer"
        },
        "content": {
            "type": "string",
            "index": "analyzed",
            "analyzer": "wp_analyzer"
        },
        "tag": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer",
                            "index_name": "tag"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed",
                            "index_name": "tag.raw"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer",
                            "index_name": "tag.raw_lc"
                        }
                    }
                },
                "slug": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "term_id": {
                    "type": "long"
                }
            }
        },
        "category": {
            "type": "object",
            "properties": {
                "name": {
                    "type": "multi_field",
                    "path": "just_name",
                    "fields": {
                        "name": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_analyzer",
                            "index_name": "category"
                        },
                        "raw": {
                            "type": "string",
                            "index": "not_analyzed",
                            "index_name": "category.raw"
                        },
                        "raw_lc": {
                            "type": "string",
                            "index": "analyzed",
                            "analyzer": "wp_raw_lowercase_analyzer",
                            "index_name": "category.raw_lc"
                        }
                    }
                },
                "slug": {
                    "type": "string",
                    "index": "not_analyzed"
                },
                "term_id": {
                    "type": "long"
                }
            }
        },
    }
}

Most of the fields are pretty self explanatory, so I’ll just outline to more complex ones:

  • date and date_gmt: We define the allowed formats because we are taking the dates out of MySQL. We also do some checking of the dates since MySQL will allow some things in a DATETIME field that ES will balk at and cause the indexing operation to fail. For instance MySQL accepts leap dates in non-leap years.
  • content: Content gets stripped of HTML and shortcodes, then converted to UTF-8 in cases where it isn’t already.
  • author and author.raw: The author field corresponds to the user’s display_name. Clearly we need to analyze the field so “Greg Ichneumon Brown” can be matched on a search for “Greg”, but what about when we facet on the field. If we use the analyzed field then the results would have the terms “greg”, “ichneumon”, and “brown”. Instead, by using ES’s multi_field mapping feature to auto generate author.raw the faceted results on that field will give us “Greg Ichneumon Brown”.
  • tag and category: Tags and Categories similarly need raw versions for faceting so we preserve the original tag. Additionally there are a number of ways users can filter the content. WordPress builds slugs from each category/tag to uniquely identify them in a human readable way and there is a unique integer (term_id) associated with each term. The tag.raw_lc is used for exact matching a term without worrying about the case. This may seem like a lot of duplication, but the overriding goal here is to avoid using MySQL for search so we index everything. Extracting data into multiple fields ensures that we will have flexibility when filtering the data in the future.
  • taxonomy.*: WordPress allows custom taxonomies (of which categories and tags are two built-in taxonomies) so we need a way to create a custom path in each document that allows access to each taxonomy. This is where Elasticsearch’s dynamic templates shine. For a custom taxonomy such as “company” the paths will become taxonomy.company.name, taxonomy.company.nametaxonomy.company.name.raw, taxonomy.company.slug, and taxonomy.company.term_id.

The ES documentation is very complete, but it’s not always easy to see how to build complex mappings that fit the individual pieces together. I hope this helps in your own ES development efforts.

Leave a comment

9 Comments

  1. naveen

     /  June 18, 2013

    I would like to know the importance of templates and properties.
    I am trying to create an index with multiple columns, I want to use those index in search

    Like

    Reply
    • Greg

       /  June 18, 2013

      It depends on what you are trying to do. If you can avoid using dynamic templates then I would suggest doing so and just setting standard mappings for the fields.

      Dynamic templates are really useful when you don’t know the names of all the fields you will be creating but do know something about what analysis (or data types) are going to be in those fields.

      Like

      Reply
  2. I know this is an old post (and a great one, by the way), but any chance you could provide some examples (either here or in another post) of the queries you used against these indices? It would be great to see which parts of the Query DSL you leveraged for this. Thanks!

    Like

    Reply
    • Greg

       /  November 26, 2013

      A lot of our queries look something like:

      {
         "query": {
            "filtered": {
               "query": {
                  "multi_match": {
                     "query": "wordpress",
                     "fields": [
                        "title",
                        "content"
                     ]
                  }
               },
               "filter": {
                  "term": {
                     "author_login": "gibrown"
                  }
               }
            }
         },
         "sort": [
            {
               "date": {
                  "order": "asc"
               }
            }
         ]
      }
      

      Generally we are using multi_match for most queries combined with some number of filters. Generally preferring to use filters as much as possible since they are cached by ES. I’ll try and write up some more complex cases in another post, but probably won’t get to it till the new year.

      Also, fyi, we’re in the process of open sourcing a lot of the code for indexing core WordPress posts into ES. Its still a work in progress, because we need to separate out the WordPress.com specific code, but its getting there.

      Like

      Reply
      • Thanks for your response! Since there are so many strategies for indexing data and also for retrieving data, the decision on which indexing strategy to use for the desired retrieval strategy can be daunting. I’m watching your github repo now, thanks again!

        Like

  3. Hello, I was wondering whether you use in your queries the “highlighting” feature. If so, then how do you prevent that the highlight contains the original html markup ? You do a html strip when indexing, but teh highlight abstract will include the first 100 characters of the original text and thus include html formatting and this is typically something you do not want to display in a highlight. Hope you can help …

    Like

    Reply
    • Greg Ichneumon Brown

       /  May 12, 2014

      Hi Marc,

      We actually strip all html before we send it to Elasticsearch for all string fields. As you’ve found, the ES Highlighter doesn’t work well on HTML. We’ve also seen cases where HTML gets past our indexing stripping in a pre block and the highlighting still becomes problematic. To handle those cases we do some html tag stripping on the output.

      Eventually it would be nice to improve the built in ES highlighter.

      Like

      Reply
  4. Great post. I was wondering how you have mapped posts and comments ? Are comments defined as parent-child, nested type, denormalized or … I know there are different ways to work with this kind of relationships but I was wondering what the best way is since you probably want to search for text in comments and then return the post or find the comments a user has given. Hope you can help.

    Marc

    Like

    Reply
    • Greg Ichneumon Brown

       /  June 10, 2014

      Currently we index comments into separate docs that are children of the blog doc. Mostly this is to keep comments in the same shard as posts as we’ve found that parent-child queries are tough to scale.

      post_id is indexed with the comment so simple to find the post for a matching comment. Right now our indexing wouldn’t really help you run a search for words that occur in multiple comments (or in a comment and a post) and resolve the post_id. Maybe you could implement something like this with aggregations on the current schema. Aggregate the top post_id for a search across the content of posts and comments.

      Sidenote: This is why its important to use the same field name across multiple doc types when that makes sense.

      Like

      Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: