Elasticsearch: Five Things I was Doing Wrong

Update: Also check out my series on scaling Elasticsearch.

I’ve been working with Elasticsearch off and on for over a year, but recently I attended Elasticsearch.com’s training class (well worth the time and money) and discovered a few significant things that I was doing just plain wrong.

Before using Elasticsearch I used Lucene directly, and so a few of the errors I made were due to not understanding some of the things ES does for you behind the scenes.

As background, most of the data I’m indexing conforms to the WordPress database schema.

1. Use Arrays for Fields with Multiple Values

For some reason I had neglected to use arrays when creating fileds such as a list of tags attached to a document. At some point I started concatenating the tags together into a long string separated by semicolons and I used a custom analyzer to break them apart like this:

"analysis" : {
  "tokenizer" : {
    "semicolon_token" : {
      "type" => "pattern",
      "pattern" => ";"
  } },
  "analyzer" : {
    "wp_tag_analyzer" : {
      "type" => "custom",
      "tokenizer" => "semicolon_token",
  } }
}

Or, for fields that were lists of URLs I just separated them by spaces and used the whitespace analyzer. Both methods worked fine for the initial applications, but have some obvious drawbacks. Explicitly inserting a character sequence as a delimiter will almost always means you will hit an edge case somewhere down the road where it will break.

Using an array of items is a much easier way, but somehow, after initially reading about the array mapping, I completely forgot that it existed. I think I was thinking of ES too much as a text searching engine and not enough as a general JSON data store.

2. Don’t Use `store=true` When Mapping Fields

If you are storing the full _source of the document, then there is very little reason to store individual fields separately. You just inflate your index size. I originally started storing the content and titles of documents because I thought it might speed up the highlighting. In practice, I don’t think it did anything for me, and many of our queries don’t do any highlighting at all.

In the end this was a case of premature optimization. Maybe at some point if I find that 90% of the time we are just returning the post_id and using that to lookup the original content in MySQL we’ll consider storing that separately to reduce network traffic and load caused by extracting the post_id field from _source, but that still feels premature at this point.

For debugging reasons I would never consider turning off storing _source. It is far too useful to know exactly what data was entered, and you never know when you might want to use a different field for a new application.

3. Don’t Manually Flush, Optimize, or Refresh

Elasticsearch takes care of these core Lucene operations for me, there was never any good reason for me to issue one of these commands when the default ES settings would accomplish it within a few minutes.

The optimize command in particular is dangerous since it merges all segments in the Lucene index (a very time consuming operation). The code I wrote which at first was issuing innocuous optimize commands after doing some bulk indexing by hand eventually started getting called repeatedly in automated jobs. Fortunately it never rose to a level of causing real problems, but its easy for code you write to get unintentionally called.

Again, this was a case of premature optimization.

4. Set the Appropriate Production Flags

This is another case that didn’t cause a real issue, but could have in the future. The default settings for ES are set to ensure it works to quickly start development. This means that a few of the default settings are not what you want when in production. In particular:

discovery.zen.minimum_master_nodes
- Should be set to something like N/2 + 1 where N is the number of available master nodes.
action.disable_delete_all_indices
- Do you really want to allow a single command (that could be mistyped) to delete all of your indices? No, neither do I.
gateway.recover_after_nodes
- How many nodes need to be up before the recovery process starts replicating data around the cluster.
index.cache.field.type: soft (in 0.90 this field name changed to index.fielddata.cache. Thanks Olivier for the heads up.)
- I started setting my field cache to soft to ensure that it never created OutOfMemory errors. I think this was particularly helpful because we are doing a lot of faceting.
- Update 2014-01-09: the indices.fielddata.cache.size setting introduced in 0.90 is a better way to prevent running into OutOfMemory exceptions due to the field cache getting too big. I am no longer using the soft field data cache.

5. Do Not Use `_type` as Another Field

The _type field can entice you to use it as another field to indicate a category for your document. Don’t let it.

Here’s where I went wrong. WordPress posts can have different types (post_type) which allow displaying the content of the post in different ways (e.g. image posts, video posts, quotes, a status message). This despite the different post types all using the same schema. This seemed to align pretty well with the _type fields so I used an ES dynamic mapping to have post_type == _type.

The biggest problem with this is how do you determine the document’s _type after a post has been deleted from the database and you want to also delete it from your index. A document is uniquely identified both by its _id and its _type.

If you delete from your RDBMS first (or NoSQL data store flavor of the month), then you may no longer have the _type available to delete the object.
If you delete from ES first then what if the RDBMS delete operation fails for some reason.

Making the _type independent of any data within the document ensures that all you will need is the document id. This was one of those “Oh, that was dumb of me” bugs that I completely missed when building my index.

Posted

January 24, 2013

Greg Ichneumon Brown

Tags:

ElasticSearch, Mistakes

Comments

26 responses to “Elasticsearch: Five Things I was Doing Wrong”

Five things I learned from ElasticSearch Training « Euphonious Intuition

February 2, 2013

[…] attended the Boston ElasticSearch training seminar and had a great time. In the spirit of “Elasticsearch: Five Things I was Doing Wrong“, I thought I’d write up a few tips that I […]

LikeLike

Reply
Bruno Miranda

February 14, 2013

You take a performance hit when you set index.cache.field.type: soft, the other option would be to add more RAM

LikeLike

Reply
1. Greg
  
  February 14, 2013
  
  Yeah, you’re definitely right about the performance hit this setting causes.
  
  The difficulty I had with using the resident cache type is that there is no guarantee that I won’t run out of memory causing OutOfMemory exceptions that in the past had brought down my server. Maybe the state of ES has improved since early 2012 such that I should no longer be worried about this.
  
  We are doing faceting on fields that have very large vocabularies for Polldaddy’s “popular answers” and “popular words” results page. Because this is user entered data it is both hard to anticipate and hard to control which is why I feel like the soft cache setting is necessary regardless of how much RAM I give the server.
  
  I should write up a more detailed post on this. Thanks for picking up this point.
  
  LikeLike
  
  Reply
Olivier Favre

February 27, 2013

It’s actually action.disable_delete_all_indices, you dropped the prefix.
And I believe from v0.90.0.Beta1 on, index.cache.field.type is now called index.fielddata.cache.

LikeLike

Reply
1. Greg
  
  February 27, 2013
  
  Had missed that change in 0.90, thanks for the heads up, and for the correction.
  
  LikeLiked by 1 person
  
  Reply
trenpixster

July 31, 2013

That was a nice roundup, thanks for that Greg!

LikeLike

Reply
Jason Scheller

September 19, 2013

I can’t speak to highlighting, but using _store=true has a noticeable performance improvement if you’re actually returning that field in your searches. If store is disabled, elasticsearch has to load the entire document (likely from disk), parse all the JSON, and then retrieve that one field. The source documents will also be loaded into cache, which takes up a lot more memory than just caching a single field.

LikeLike

Reply
Building Word Clouds with Faceted Search | gibrown

November 1, 2013

[…] it could have been worse. Commenting on a previous post on this site Bruno asked me why I suggested setting index.cache.field.type: soft given that it […]

LikeLike

Reply
emilamork

June 27, 2014

Great post:)

But, modifying the refresh interval in elasticsearch can increase the speed of indexing if you bulk index many documents.

But i guess you ment not to call refresh manuelly?

LikeLike

Reply
1. Greg Ichneumon Brown
  
  June 27, 2014
  
  Ya we just don’t do any manual/client triggered refreshes. We do adjust the refresh rate on our larger indices.
  
  LikeLike
  
  Reply
  1. Ievgenii Fedorenko
    
    February 19, 2016
    
    How bad is to call refresh manually ? Our refresh rate is set to 1s and it is to much to wait for an end user in some case when first indexing happens and then we need to search right after. Should we rather consider setting refresh rate to a smaller number?
    
    LikeLike
  2. Greg Ichneumon Brown
    
    February 19, 2016
    
    It can be pretty bad to call refresh manually. Depends on how often you do it. Smaller refresh rate can help. It may be worth looking at the recommendations here: https://www.elastic.co/guide/en/elasticsearch/reference/2.2/docs-index_.html#index-refresh
    
    LikeLike
samant77

September 5, 2014

hi Greg,

On #5 — Lets say my RDBMS has tables as Order, OrderLines, Shipments, Invoices as 4 different related tables ( there is a co-relation between these entities).
What would be a good design in ES :
1. To co-relate the 5 tables data and load them into ES as 1 Single Index -1 single Type
or
2. To create 1 index and 5 different Types
or
3. To create 5 indices with 1 type each in them.

Design consideration being – Performance of ES search queries and Ease of loading data from the RDBMS into ES (ETL process),

Any help/advice is much appreciated !

Regards
Samant

LikeLike

Reply
1. Greg Ichneumon Brown
  
  September 9, 2014
  
  Hi Samant,
  
  Depends a lot on how you are querying the data and how much data you have.
  
  If you are querying across documents, it is often useful to have the docs in the same index. But there are some practical benefits to putting them in different indices. Doing so can make it really easy to drop an entire index and reindex only one set of documents.
  
  If you’re mostly querying across the different types though, then I’d say put them into the same index. Should make for faster searches, and allow filters that work on multiple documents and hence can get cached better.
  
  LikeLiked by 1 person
  
  Reply
samant77

September 5, 2014

Sorry for the typo, I should have said – 4 Types not 5 Types – given the example has 4 RDBMS tables.

LikeLike

Reply
geekpete

January 15, 2015

It’s probably not recommended, but after doing a very large delete (ie, all docs apart from a small subset that match a query) in a very large index, the doc counts are correct but the raw size of the index has not changed.

And assuming that snapshot/restore will do nothing with the segments other than lift and shift them, the deletes will have to be cleared out before doing a snapshot to avoid the snapshot being unnecessarily large.

I suppose an optimize is required to free up the disk space used by all that deleted data, but the optimize is often more expensive than reindexing only that subset of data you wanted to keep in the first place to a new index?

LikeLike

Reply
1. Greg Ichneumon Brown
  
  January 15, 2015
  
  Ya, if you’re deleting most of your index regularly you are probably better off reindexing and deleting the old index.
  
  When you do a delete it only marks the documents as deleted, it doesn’t actually purge them immediately. They get purged when the segments of the index get merged together. When this happens depends on a number of settings (that in my experience are not worth adjusting). It may not get triggered until you do some more indexing. An optimize will trigger a merge and is sometimes a good idea in these cases.
  
  AFAIK snapshotting copies the index files as they are, so yes without waiting for a merge or doing an optimize your index will continue to be large.
  
  LikeLiked by 1 person
  
  Reply
JimO

January 29, 2015

Hi Greg,

I started learning Elasticsearch this past Monday. I am trying to use it in Symfony via FOSElasticaBundle.

I have been trying to do an Obtao tutorial which is very basic, using my own classes and data to save time.

The code in question that I am having difficulty with is in my controller and is as follows:

$elasticaManager = $this->container->get(‘fos_elastica.manager’);
$repository = $elasticaManager->getRepository (‘JTO\TestQBundle\Entity\Repository\QuestionRepository’); // FAILS HERE
$results = $repository->search($questionSearch);

When the $getRepository line runs I get the following error message:

“No search finder configured for JTO\TestQBundle\Entity\Repository\QuestionRepository”

I have googled this error message and problem to death and I have only found one other person that has had this problem, but that person never reported his solution.

The bottom-line is that the name of my repository has not been assigned to the “entities” array in the Repository Manager.

Have you heard of this problem and, if you have, do you have any ideas on how to fix it?

Thank you in advance.

Jim O

LikeLike

Reply
1. Greg Ichneumon Brown
  
  January 30, 2015
  
  I haven’t ever used Symfony or that Elastica bundle, so I unfortunately haven’t seen that problem before.
  
  We use Elastica directly and it has worked really well for us. Most of the time though we just use raw associative arrays to set the raw ES queries rather than using all of the objects that Elastica has. We’ve mostly found those to be more complicated than necessary. Using raw queries makes it easier for us to go from testing on the API directly to adding the query to our code.
  
  LikeLiked by 1 person
  
  Reply
  1. faracasa2014
    
    February 5, 2015
    
    Hi Greg, This is just FYI. It was not an Elasticsearch problem. It was a Symfony problem, having to do with the way I was naming my Entity Repository. When I conformed with the naming convention, it worked. Thanks for your feedback, Jim
    
    LikeLike
khoa

March 29, 2015

you can actually query across types on elastic using _all as type: GET /items/_all/_xkUu8rHHN07hRSUiumuZe

how expensive is it compared to: GET /items/photo/_xkUu8rHHN07hRSUiumuZe ? I don’t know.

LikeLike

Reply
Andrew D

April 21, 2016

The first item “Use Arrays for Fields with Multiple Values” is common to all document-oriented databases. MongoDB uses array-type fields as well to represent related objects (i.e. foreign keys).

Always remember that ElasticSearch is BOTH a full-text (Lucene) index AND a documents DB. Follow the tips and tricks for both kinds of storage.

LikeLike

Reply
Toto

April 28, 2016

About your last point, it is in fact possible to achieve it pretty easily using RDBMS transaction.

LikeLike

Reply
1. Greg Ichneumon Brown
  
  April 28, 2016
  
  Maybe… depends a lot on your existing system. In our case, deleting docs from ES happens potentially seconds after the DB delete.
  
  Also, if I understand what you’re proposing, doesn’t that mean that the DB write fails if the ES update fails? It couples the two systems together pretty tightly.
  
  LikeLike
  
  Reply
Kris Meister

February 23, 2018

I know this post is old.

I’d like to point out, from what i read. Storing the whole _source is fine if you set the field property mapping to:
{“type”: “object”, “enabled”: false}

This is stored as a binary bson object and is very small and quick to serve.

LikeLike

Reply
1. Greg Ichneumon Brown
  
  February 23, 2018
  
  Ya the issue was more that storing individual fields is not very useful unless you are really using them all the time. For instance we store blog_id and post_id because 99% of our queries just return those two values. Basically all other fields are not stored and we just extract them from _source (which is stored).
  
  LikeLike
  
  Reply