Three Principles for Multilingal Indexing in Elasticsearch

Recently I’ve been working on how to build Elasticsearch indices for WordPress blogs in a way that will work across multiple languages. Elasticsearch has a lot of built in support for different languages, but there are a number of configuration options to wade through and there are a few plugins that improve on the built in support.

Below I’ll lay out the analyzers I am currently using. Some caveats before I start. I’ve done a lot of reading on multi-lingual search, but since I’m really only fluent in one language there’s lots of details about how fluent speakers of other languages use a search engine that I’m sure I don’t understand. This is almost certainly still a work in progress.

In total we have 30 analyzers configured and we’re using the elasticsearch-langdetect plugin to detect 53 languages. For WordPress blogs, users have sometimes set their language to the same language as their content, but very often they have left it as the default of English. So we rely heavily on the language detection plugin to determine which language analyzer to use.

Update: In comments, Michael pointed out that since this post was written the langdetect plugin now has a custom mapping that the mapping example below is not using. I’d highly recommend checking it out for any new implementations.

For configuring the analyzers there are three main principles I’ve pulled from a number of different sources.

1) Use very light or minimal stemming to avoid losing semantic information.

Stemming removes the endings of words to make searches more general, however it can lose a lot of meaning in the process. For instance, the (quite popular) Snowball Stemmer will do the following:

computation -> comput
computers -> comput
computing -> comput
computer -> comput
computes -> comput

international -> intern
internationals -> intern
intern -> intern
interns -> intern

A lot of information is lost in doing such a zealous transformation. There are some cases though where stemming is very helpful. In English, stemming off the plurals of words should rarely be a problem since the plural is still referring to the same concept. This article on SearchWorkings gives further discussion of the pitfalls of the Snowball Stemmer, and leads to Jacque Savoy’s excellent paper on stemming and stop words as applied to French, Italian, German, and Spanish. Savoy found that doing minimal stemming of plurals and feminine/masculine forms of words performed well for these languages. The minimal_* and light_* stemmers included in Elasticsearch implement these recommendations allowing us to take a limited stemming approach.

So when there is a minimal stemmer available for a language we use it, otherwise we do not do any stemming at all.

2) Use stop words for those languages that we have them for.

This ensures that we reduce the size of the index and speed up searches by not trying to match on very frequent terms that provide very little information. Unfortunately, stop words will break certain searches. For instance, searching for “to be or not to be” will not get any results.

The new (to 0.90) cutoff_frequency parameter on the match query may provide a way to allow indexing stop words, but I currently am still unsure whether there are other implications on other types of queries, or how I would decide what cutoff frequency to use given the wide range of documents and languages in a single index. The very high number of English documents compared to say Hebrew also means that Hebrew stopwords may not be frequent enough to trigger the cutoff frequencies correctly if searching across all documents.

For the moment I’m sticking with the stop words approach. Weaning myself off of them will require a bit more experimentation and thought, but I am intrigued by finding an approach that would allow us to avoid the limitations of stop words and enable finding every blog post referencing Shakespeare’s most famous quote.

3) Try and retain term consistency across all analyzers.

We use the ICU Tokenizer for all cases where the language won’t do significantly better with a custom tokenizer. Japanese, Chinese, and Korean all require smarter tokenization, but using the ICU Tokenizer ensures we treat other languages in a consistent manner. Individual terms are then filtered using the ICU Folding and Normalization filters to ensure consistent terms.

Folding converts a character to an equivalent standard form. The most common conversion that ICU Folding provides is converting characters to lower case as defined in this exhaustive definition of case folding. But folding goes far beyond lowercasing, there are symbols in many languages where multiple characters essentially mean the same thing (particularly from a search perspective). UTR30-4 defines the full set of foldings that the ICU Folding performs.

Where Folding converts a single character to a standard form, Normalization converts a sequence of characters to a standard form. A good example of this, straight from Wikipedia, is “the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet).” Another entertaining example of character normalization is that some Roman numerals (Ⅸ) can be expressed as a single UTF-8 character. But of course for search you’d rather have that converted to “IX”. The ICU Normalization sections have links to the many docs defining how normalization is handled.

By indexing using these ICU tools we can be fairly sure that searching across all documents, regardless of language, with just a default analyzer will give results for most queries.

The Details (there’s always exceptions to rules)

Asian languages that do not use whitespace for word separations present a non-trivial problem when indexing content. ES comes with the built in CJK analyzer that indexes every pair of symbols into a term, but there are plugins that are much smarter about how to tokenize the text.
- For Japanese (ja) we are using the Kuromoji plugin built on top of the seemingly excellent library by Atilika. I don’t know any Japanese, so really I am probably just impressed by their level of documentation, slick website, and the fact that they have an online tokenizer for testing tokenization.
- There are a couple of different versions of written Chinese (zh), and the language detection plugin distinguishes between zh-tw and zh-cn. For analysis we use the ES Smart Chinese Analyzer for all versions of the language. This is done out of necessity rather than any analysis on my part. The ES plugin wraps the Lucene analyzer which performs sentence and then word segmentation using a Hidden Markov Model.
- Unfortunately there is currently no custom Korean analyzer for Elasticsearch that I have come across. For that reason we are only using the CJK Analyzer which takes each bi-gram of symbols as a term. However, while writing this post I came across a Lucene mailing list thread from a few days ago which says that a Korean analyzer is in the process of being ported into Lucene. So I have no doubt that will eventually end up in ES or as an ES plugin.
Elasticsearch doesn’t have any built in stop words for Hebrew (he) so we define a custom list pulled from an online list (Update: this site doesn’t exist anymore, our list of stopwords is located here). I had some co-workers cull the list a bit to remove a few of the terms that they deemed a bit redundant. I’ll probably end up doing this for some other languages as well if we stick with the stop words approach.
Testing 30 analyzers was pretty non-trivial. The ES Inquisitor plugin’s Analyzers tab was incredibly useful for interactively testing text tokenization and stemming against all the different language analyzers to see how they functioned differently.

Finally we come to defining all of these analyzers. Hope this helps you in your multi-lingual endeavors.

Update [Feb 2014]: The PHP code we use for generating analyzers is now open sourced as a part of the wpes-lib project. See that code for the latest methods we are using.

Update [May 2014]: Based on the feedback in the comments and some issues we’ve come across running in production I’ve updated the mappings below. The changes we made are:

Perform ICU normalization before removing stopwords, and ICU folding after stopwords. Otherwise stopwords such as “même” in French will not be correctly removed.
Adjusted our Japanese language analysis based on a slightly adjusted use of GMO Media’s methodology. We were seeing a significantly lower click through rate on Japanese related posts than for other languages, and there was pretty good evidence that the morphological language analysis would help.
Added the Elision Token filter to French. “l’avion” => “avion”

Potential improvements I haven’t gotten a chance to test yet because we need to run real performance tests to be sure they will actually be an improvement:

Duplicate tokens to handle different spellings (eg “recognize” vs “recognise”).
Morphological analysis of en and ru
Should we run spell checking or phonetic analysis
Include all stopwords and rely on cutoff_frequency to avoid the performance problems this will introduce
Index bigrams with the shingle analyzer
Duplicate terms, stem them, then unique the terms to try and index both stemmed and non-stemmed terms

Thanks to everyone in the comments who have helped make our multi-lingual indexing better.

{
  "filter": {
    "ar_stop_filter": {
      "type": "stop",
      "stopwords": ["_arabic_"]
    },
    "bg_stop_filter": {
      "type": "stop",
      "stopwords": ["_bulgarian_"]
    },
    "ca_stop_filter": {
      "type": "stop",
      "stopwords": ["_catalan_"]
    },
    "cs_stop_filter": {
      "type": "stop",
      "stopwords": ["_czech_"]
    },
    "da_stop_filter": {
      "type": "stop",
      "stopwords": ["_danish_"]
    },
    "de_stop_filter": {
      "type": "stop",
      "stopwords": ["_german_"]
    },
    "de_stem_filter": {
      "type": "stemmer",
      "name": "minimal_german"
    },
    "el_stop_filter": {
      "type": "stop",
      "stopwords": ["_greek_"]
    },
    "en_stop_filter": {
      "type": "stop",
      "stopwords": ["_english_"]
    },
    "en_stem_filter": {
      "type": "stemmer",
      "name": "minimal_english"
    },
    "es_stop_filter": {
      "type": "stop",
      "stopwords": ["_spanish_"]
    },
    "es_stem_filter": {
      "type": "stemmer",
      "name": "light_spanish"
    },
    "eu_stop_filter": {
      "type": "stop",
      "stopwords": ["_basque_"]
    },
    "fa_stop_filter": {
      "type": "stop",
      "stopwords": ["_persian_"]
    },
    "fi_stop_filter": {
      "type": "stop",
      "stopwords": ["_finnish_"]
    },
    "fi_stem_filter": {
      "type": "stemmer",
      "name": "light_finish"
    },
    "fr_stop_filter": {
      "type": "stop",
      "stopwords": ["_french_"]
    },
    "fr_stem_filter": {
      "type": "stemmer",
      "name": "minimal_french"
    },
    "he_stop_filter": {
      "type": "stop",
      "stopwords": [/*excluded for brevity*/]
    },
    "hi_stop_filter": {
      "type": "stop",
      "stopwords": ["_hindi_"]
    },
    "hu_stop_filter": {
      "type": "stop",
      "stopwords": ["_hungarian_"]
    },
    "hu_stem_filter": {
      "type": "stemmer",
      "name": "light_hungarian"
    },
    "hy_stop_filter": {
      "type": "stop",
      "stopwords": ["_armenian_"]
    },
    "id_stop_filter": {
      "type": "stop",
      "stopwords": ["_indonesian_"]
    },
    "it_stop_filter": {
      "type": "stop",
      "stopwords": ["_italian_"]
    },
    "it_stem_filter": {
      "type": "stemmer",
      "name": "light_italian"
    },
    "ja_pos_filter": {
      "type": "kuromoji_part_of_speech",
      "stoptags": ["\\u52a9\\u8a5e-\\u683c\\u52a9\\u8a5e-\\u4e00\\u822c", "\\u52a9\\u8a5e-\\u7d42\\u52a9\\u8a5e"]
    },
    "nl_stop_filter": {
      "type": "stop",
      "stopwords": ["_dutch_"]
    },
    "no_stop_filter": {
      "type": "stop",
      "stopwords": ["_norwegian_"]
    },
    "pt_stop_filter": {
      "type": "stop",
      "stopwords": ["_portuguese_"]
    },
    "pt_stem_filter": {
      "type": "stemmer",
      "name": "minimal_portuguese"
    },
    "ro_stop_filter": {
      "type": "stop",
      "stopwords": ["_romanian_"]
    },
    "ru_stop_filter": {
      "type": "stop",
      "stopwords": ["_russian_"]
    },
    "ru_stem_filter": {
      "type": "stemmer",
      "name": "light_russian"
    },
    "sv_stop_filter": {
      "type": "stop",
      "stopwords": ["_swedish_"]
    },
    "sv_stem_filter": {
      "type": "stemmer",
      "name": "light_swedish"
    },
    "tr_stop_filter": {
      "type": "stop",
      "stopwords": ["_turkish_"]
    }
  },
  "analyzer": {
    "ar_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ar_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "bg_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "bg_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ca_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ca_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "cs_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "cs_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "da_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "da_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "de_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "de_stop_filter", "de_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "el_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "el_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "en_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "en_stop_filter", "en_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "es_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "es_stop_filter", "es_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "eu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "eu_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fa_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fa_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fi_stop_filter", "fi_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "elision", "fr_stop_filter", "fr_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "he_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "he_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hi_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hu_stop_filter", "hu_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hy_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hy_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "id_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "id_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "it_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "it_stop_filter", "it_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ja_analyzer": {
      "type": "custom",
      "filter": ["kuromoji_baseform", "ja_pos_filter", "icu_normalizer", "icu_folding", "cjk_width"],
      "tokenizer": "kuromoji_tokenizer"
    },
    "ko_analyzer": {
      "type": "cjk",
      "filter": []
    },
    "nl_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "nl_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "no_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "no_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "pt_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "pt_stop_filter", "pt_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ro_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ro_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ru_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ru_stop_filter", "ru_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "sv_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "sv_stop_filter", "sv_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "tr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "tr_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "zh_analyzer": {
      "type": "custom",
      "filter": ["smartcn_word", "icu_normalizer", "icu_folding"],
      "tokenizer": "smartcn_sentence"
    },
    "lowercase_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "keyword"
    },
    "default": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    }
  },
  "tokenizer": {
    "kuromoji": {
      "type": "kuromoji_tokenizer",
      "mode": "search"
    }
  }
}

Posted

May 1, 2013

Greg Ichneumon Brown

Tags:

ElasticSearch, language detection, multi-lingual, NLP, tokenization

Comments

66 responses to “Three Principles for Multilingal Indexing in Elasticsearch”

Gregor

May 27, 2013

So for indexing you use the language detection plugin to determine the language of the document and use the corresponding analyzer.
And for searching you always relay on the default analyzer without attempting to “guess” the language?

LikeLike

Reply
1. Greg
  
  May 27, 2013
  
  For indexing, yes we do language detection to select the analyzer.
  
  When querying, it depends. If we have a good guess at the user’s language (ie they are on de.search.wordpress.com or the site they are on has a particular language selected) then we can use the appropriate language. But when we don’t have a good guess, then we can fall back to the default analyzer which should work pretty well across most languages.
  
  Ideally we try and use the appropriate language analyzer, but there are definitely cases where I know we won’t be able to so having a fallback is important. The biggest concern with the fallback is how stemming will truncate terms. Hopefully using only minimal stemming will minimize how much impact this has.
  
  I haven’t done any deep analysis of what impact this has on search relevancy yet though.
  
  LikeLike
  
  Reply
  1. Nate
    
    November 20, 2013
    
    Greg,
    
    So when you say you do “language detection” are you doing this independently from elastic search? Or is there a way to tie content.lang as set by the plugin to a particular analyzer automatically? I am very new to elastic search and it would be helpful to know.
    
    LikeLike
  2. Greg
    
    November 20, 2013
    
    Hi Nate
    
    We run the elasticsearch-langdetect plugin on the same ES cluster and then when indexing first make a call to it to determine the language of the content of the doc. Then we make a separate call to index the document.
    
    I don’t believe there is a way to index the document and determine the language at the same time.
    
    It’s also possible to run the langdetect code independent of ES (potentially in your client), but for us using the ES plugin made it easier to deploy and it doesn’t add much load to the cluster.
    
    LikeLike
  3. Nate
    
    November 21, 2013
    
    Thanks for the prompt reply! Makes sense.
    
    LikeLike
Elasticsearch: Vyhledáváme hezky česky | IT mag – novinky z IT

July 1, 2013

[…] Pro další studium ICU může sloužit článek Three Principles for Multilingal Indexing in Elasticsearch. […]

LikeLike

Reply
Avi G

October 24, 2013

Amazing post! Helped me a lot. Thank you for all the information!

LikeLike

Reply
Ale

October 31, 2013

Hei,
It’s not clear for me how do you decide which analyzer to use depending on the field’s content. Do you have a field for each language, or were you able to use different analyzers for the same field at indexing ?

Thanks !

LikeLike

Reply
1. Greg
  
  October 31, 2013
  
  You can specify the analyzer to use when indexing. In my case I have a field for each document called lang_analyzer which specifies which analyzer to use for the document.
  
  You configure which field is used for specifying the analysis in the _analyzer mapping field.
  
  For querying you either need to specify the analyzer or you just rely on the default. Using the ICU plugins for analysis ensures consistent tokenization across all languages so that the default should work pretty well.
  
  LikeLike
  
  Reply
  1. Ale
    
    November 1, 2013
    
    Thanks! It really helped.
    
    LikeLike
Daniel

December 17, 2013

Hi Greg! First of all, great post, thanks for it!
How would you go about if your data was region names in various languages, for instance, I have more than 100k regions, and one document in ES contains the names Munich, München, Munique, etc. Same goes for the 100k+ regions. Having one document per language would make my index grow a lot.
What I want to have is an auto complete where people can search regions, but I don’t really know the language they best know the region, so they can be seeing the site in English but searching the region in German. So to have an educated guess of the language is hard. Do you think a set up like the one you presented would be appropriate for data such as the one I stated? Or would you do something different?

Thanks a lot,
Daniel

LikeLike

Reply
1. Greg
  
  December 18, 2013
  
  If I understand the use case, I think you could just use the ICU tokenizer, folding, and normalization on a single field without any stemming or stop words (the “default” analyzer in the code above). If you are only indexing place names across multiple languages you shouldn’t need stemming/stop words anyways. ICU should give you results that work pretty well across European languages at least. You wouldn’t have any fancy tokenization of Korean, Japanese, or Chinese. I don’t know enough about place names in those languages to know how big a problem that would be.
  
  If all of the place names you have are already separated, then be sure to index them as an array of strings, and consider indexing them as an analyzed and a non-analyzed (see the multi-field type mapping example).
  
  That way you can retain the original text and sequence of words. Probably some other details to work out to get auto suggest working well also, but I haven’t yet played with the new suggest features.
  
  LikeLike
  
  Reply
  1. Daniel
    
    January 8, 2014
    
    Thanks for the feedback Greg, I’m trying some stuff to see how it works, and what is faster, and your tips certainly helped.
    
    Thank you
    
    LikeLike
Michael

January 22, 2014

any idea how to plug in the polish (stempel) analyzer? have you tried it?

LikeLike

Reply
1. Greg
  
  January 23, 2014
  
  I haven’t tried it. We probably should be using it. 🙂
  
  LikeLike
  
  Reply
Michael

January 22, 2014

also, how does one uses the elasticsearch-langdetect plugin to automatically apply the right analyzer based on computed language?

LikeLike

Reply
1. Greg
  
  January 23, 2014
  
  You can’t auto apply the analyzer to a field unfortunately. You need to make one request to analyze a block of text and get the language and then a separate request to index the data with the appropriate analyzer specified.
  
  LikeLike
  
  Reply
  1. Michael
    
    January 26, 2014
    
    Wouldn’t this work? https://gist.github.com/anonymous/ef85ef7c9ff520369a78
    
    LikeLike
  2. Greg
    
    January 27, 2014
    
    Oh cool! The langdetect plugin has been updated since I originally wrote this post, and I hadn’t noticed that change.
    
    Yes, I think that should work. I’ll need to use this method in the future. Thanks!
    
    LikeLike
  3. Michael
    
    January 27, 2014
    
    unfortunately the _langdetect method is wayyy inaccurate, especially for short phrases..
    
    LikeLike
  4. Greg
    
    January 27, 2014
    
    Ya, I have some custom client code wrapping my call to langdetect so that if there is less than 300 chars of actual text then we don’t bother running it and use some fallbacks.
    
    I hacked together a quick (probably not working) gist of how we call langdetect: https://gist.github.com/gibrown/8652399
    
    Might be good to submit an issue against the plugin with specific examples. Short text is generally a harder problem, but there may be some simple changes that will make things better.
    
    LikeLike
Gregor

February 8, 2014

Excellent article. I thought readers might be interested in Rosette Search Essentials for Elasticsearch, from Basis Technologies, which we launched last night at hack/reduce in Cambridge, MA. It’s a plugin that does three neat things that improve multilingual search quality:

– Intelligent CJKT tokenization/segmentation
– Lemmatization: performs morphological analysis to find the “lemma” or dictionary form of the words in your documents, which is far superior to stemming.
– Decompounding: languages like German contain compound words that don’t always make great index terms. We break these up into their constituents so you can index them too.

Handles Arabic, Chinese, Czech, Danish, Dutch, English, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Swedish, Thai and Turkish.

Check it out here: http://basistech.com/elasticsearch

Read a bit more about the recall and precision benefits that lemmatization and decompounding can offer here: See this paper for examples: http://www.basistech.com/search-in-european-languages-whitepaper/)

I’m the Director of Product Management at Basis. I would love feedback on the product and to hear from anyone who has gnarly multilingual search problems.

LikeLike

Reply
1. Greg
  
  February 9, 2014
  
  Hi Gregor, thanks for pointing this out and for working to make multi-lingual search better.
  
  I pretty strongly recommend against using a closed source solution such as yours for something so fundamental as search. My reasoning got lengthy, so I turned it into a full post.
  
  Happy to discuss more, either publicly or privately.
  
  Cheers.
  
  LikeLike
  
  Reply
slushi

February 14, 2014

the link to hebrew stop words seems to be broken. any ideas on where a good list can be found?

LikeLike

Reply
1. Greg
  
  February 14, 2014
  
  Thanks for pointing that out.
  
  Our complete stop word list is available here: https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-analyzer-builder.php#L351
  
  Again, I do not speak/read Hebrew, but have had native speakers look at the list. However, they are not NLP researchers.
  
  If anyone has any suggested updates, please submit a pull request on that repository.
  
  LikeLike
  
  Reply

slushi

February 14, 2014

I tried out the above settings. I suspected that the above definition could cause issues when language specific stop words contain “special” characters that would be folded into ascii characters. I built a gist that demonstrates the problem in french.

	result=`curl -s -XDELETE 'http://localhost:9200/test?pretty=true'`
	echo "$result"

	echo "attempting index creation"
	result=`curl -s -XPOST 'http://localhost:9200/test?pretty=true' -d '{

	"index" : {
	"analysis" : {
	"analyzer" : {
	"_fr" : {
	"type": "custom",
	"tokenizer": "icu_tokenizer",
	"filter": ["icu_folding", "icu_normalizer", "fr_stop_filter", "fr_stem_filter"]
	},
	"_fr2" : {
	"type": "custom",
	"tokenizer": "icu_tokenizer",
	"filter": ["icu_normalizer", "fr_stop_filter", "fr_stem_filter"]
	},
	"default": {
	"type": "custom",
	"tokenizer": "icu_tokenizer",
	"filter": ["icu_folding", "icu_normalizer"]
	}
	},
	"filter" : {
	"fr_stop_filter": {
	"type": "stop",
	"stopwords": ["_french_"]
	},
	"fr_stem_filter": {
	"type": "stemmer",
	"name": "light_french"
	}
	}
	}
	}
	}'`
	echo $result

	result=`curl -s "http://localhost:9200/_analyze?analyzer=french&pretty=true&text=M%C3%AAme"`
	echo 'french analyzer: ' $result

	result=`curl -s "http://localhost:9200/test/_analyze?analyzer=_fr&pretty=true&text=M%C3%AAme"`
	echo 'folding/normalizing analyzer' $result

	result=`curl -s "http://localhost:9200/test/_analyze?analyzer=_fr2&pretty=true&text=M%C3%AAme"`
	echo 'normalizing analyzer' $result

view raw

gistfile1.txt

hosted with ❤ by GitHub

Did you guys decide this is acceptable? I think if the folding filter is moved to the end of the filter chain, this issue would disappear, but I don’t know what other effects that would have.

LikeLike

Greg

February 14, 2014
Wow, you’re totally right. No, its not really acceptable, definitely a bug. Thanks!

I think the folding filter should be last in the list, or we should use custom stopword lists that have the characters already folded. Probably this:
```
"filter": ["icu_normalizer", "fr_stop_filter", "fr_stem_filter", "icu_folding"]
```
This bug probably doesn’t affect search quality too much. It only applies to a few words in each language. However, including stop words in the index definitely makes the index bigger and could significantly slow down searches.

We’ll have to do some experimentation to figure out what the right filtering is. Will be interesting to see how much of a performance improvement we get from this change.

FYI, character folding is definitely very worthwhile. We did some work with one of our VIPs on a French site, and without character folding there were definitely complaints about the search.

Thanks again!

LikeLike
Reply

jettro

April 23, 2014

Thanks for the nice article. One of the links is dead. The article on searchworkings has moved to: http://blog.trifork.com/2011/12/07/analysing-european-languages-with-lucene/

regards Jettro

LikeLike

Reply
1. Greg Ichneumon Brown
  
  April 23, 2014
  
  Link updated. Thanks for the heads up!
  
  LikeLike
  
  Reply
Simon

May 21, 2014

I love this post – come back here from time to time, because you’re regularly updating it – thanks for that! Learned a lot here! We’ve used that information for improving search results on our multilingual site Pixabay.com (20 languages).

To give back something – as a German based company, we could fine tune some things for search in German:

Instead of plan “icu_folding” one should better use a customized filter and exclude a few special characters:

“filter”: {
“de_icu_folding”: { “type”: “icu_folding”, “unicodeSetFilter”: “[^ßÄäÖöÜü]” },
“de_stem_filter”: { “type”: “stemmer”, “name”: “minimal_german” },
}

Then, add a char filter to transform the excluded characters:

“char_filter”: {
“de_char_filter”: {
“type”: “mapping”,
“mappings”: [u”ß=>ss”, u”Ä=>ae”, u”ä=>ae”, u”Ö=>oe”, u”ö=>oe”, u”Ü=>ue”, u”ü=>ue”, “ph=>f”]
}
}

Put it all together in the analyzer:

“de_analyzer”: {
“type”: “custom”, “tokenizer”: “icu_tokenizer”,
“filter”: [“de_stop_filter”, “de_icu_folding”, “de_stem_filter”, “icu_normalizer”],
“char_filter”: [“de_char_filter”]
}

Advantage: For example there are the words like “blut” and “blüte” in German, meaning “blood” and “blossom”. Using standard icu_folding, both terms are treated exactly the same way. With the custom char filter, results work as expected. The character “ü” may be written as “ue” in German, which is what the transformation basically does.

LikeLike

Reply
1. Greg Ichneumon Brown
  
  May 23, 2014
  
  This is very helpful, thanks.
  
  I’ve been testing these changes out today, and I’m looking at adding this with a few slight changes into wpes-lib:
  – I just used the default icu_folding because as far as I could tell the char_filter will have changed these characters anyways
  – I also changed the order of the filters to put the normalizer first since one of the reasons for this filter is to combine multi-character sequences into one character before folding.
  
  I think both of these changes matter more when you are dealing with multi-lingual content in a single document. Any problems you see with this? For your examples it seems to still work well.
  
  I’m also curious if you have looked at all at using a decompounder in German.
  
  LikeLike
  
  Reply
  1. Simon
    
    May 23, 2014
    
    If the char_filter is applied before icu_folding takes place, it should work. In which order does ES go though those filters?
    
    I think, icu normalizer first makes totally sense – I’ll change that in our own code right away.
    
    Didn’t know about the decompounder so far – but it sounds great! Going to test this soon!
    
    Thanks, Simon
    
    LikeLike
  2. Greg Ichneumon Brown
    
    May 23, 2014
    
    ES always applies char filters first (even before tokenization), so ya that should work well.
    
    I’d be really interested to hear how the decompounder works for you. It feels like too big a change for me to universally change without doing some thorough testing of its performance. I’d also like to test it for multiple languages and just don’t have the time to devote to it right now.
    
    Thanks again for the help, I’m going to commit these changes and make them live when we rebuild our index in a few weeks.
    
    LikeLike
  3. Simon
    
    May 23, 2014
    
    Not sure if that’s interesting for you, but we also use a word delimiter filter for all latin languages, so not for ja, zh, ko: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html
    
    “filter”: {
    “my_word_delimiter”: {
    “type”: “word_delimiter”,
    “generate_word_parts”: False,
    “catenate_words”: True,
    “catenate_numbers”: True,
    “split_on_case_change”: False,
    “preserve_original”: True
    },
    }
    
    LikeLike
  4. Greg Ichneumon Brown
    
    May 23, 2014
    
    Good to hear that works well for you.
    
    I have used a word delimiter on some smaller indices, but I (vaguely) remember running into problems in a few cases. I think I decided I didn’t have enough data to figure out how to configure it properly.
    
    I still feel like my analyzers don’t do a good job with product names and other words where punctuation or case is used as part of the word.
    
    I’m surprised you don’t use the same filter for ja, zh, and ko. I often see a lot of latin languages mixed in with Asian languages.
    
    LikeLike
  5. Simon
    
    May 23, 2014
    
    I guess it wouldn’t really hurt, but in our case, the delimiter also wouldn’t make a (relevant) difference for ja, ko, zh. We’re not dealing with full texts/sentences, but with a lot of keywords that are strictly separated into the different languages. There are a few latin names for cities, countries and the likes, but they would not be affected by the delimiter. So the delimiter would only cost a bit of performance with no real benefit …
    
    LikeLike
  6. Simon
    
    May 27, 2014
    
    I’ve looked at the German decompounder – in theory it really looks good and I’d like to use it. However, it’s not well maintained. The update frequency appears to be rather low and there’s no working version for the current ES server 1.1.x or 1.2.
    
    LikeLike
  7. Greg Ichneumon Brown
    
    May 27, 2014
    
    Thanks for the update.
    
    jprante has been pretty responsive to Github issues I’ve submitted elsewhere in the past, so maybe either submit one or even better build it locally and submit a pull request. My guess is that very little has changed in 1.1.x that would affect this.
    
    LikeLike
2. Rotem
  
  September 2, 2014
  
  Do you think that the new built-in German analyzer (available in 1.3.X) is good enough, or do you still need the custom ICU folding?
  
  LikeLike
  
  Reply
  1. Rotem
    
    September 2, 2014
    
    Forgot the link (to save search time…) – http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html#german-analyzer
    
    LikeLike
  2. Greg Ichneumon Brown
    
    September 2, 2014
    
    Thanks for pointing that out. We’re still on 1.2 and I hadn’t noticed the new analyzers yet.
    
    My 2 cents after briefly examining the analyzers (caveat: I don’t know German).
    
    Some differences with our analyzers:
    – German uses light_stemming rather than minimal_stemming. The ES docs link to this paper. My selection of minimal stemming is based largely on effects of Spanish/French languages. Quite possible light stemming would be better.
    – There is normalization, but it is language specific. I worry that this choice will run into problems with foreign words mixed into the language. Particularly words with accents. ICU seems like a very strong standard. I believe ICU is well regarded within ES, but they don’t bundle it mainly due to its size.
    – In English the analyzer is using the Porter stemmer. I disagree with this decision in most contexts. When I have used the Porter stemmer in real applications I’ve gotten pushback.
    – There appears to be no normalization in English. This means resume and resumé are two different terms.
    
    I think there are some interesting ideas in this set of analyzers. I like a lot that there are links to the papers that were used to justify the decisions. I also like that ES has tried to put together a basic set of analyzers for users even if I disagree with some of the details. Those details probably depend a lot on your application.
    
    LikeLike
  3. Rotem
    
    September 2, 2014
    
    These are good and helpful remarks. Thanks!
    
    LikeLike
  4. rcmuir
    
    September 2, 2014
    
    FYI these analyzers are just exposing the lucene per-language analyzers, which means they are specific to the needs of the language, typically undergone formal evaluation etc. They don’t depend on any external libraries like ICU.
    
    The german normalization, for example incorporates context-dependent handling that is pretty common in these analyzers, usually can’t be expressed as a unicode normalization form (i before e, except except after c, x only at the end of the word, etc), and usually isn’t appropriate for other languages.
    
    And in english, resume is not always a noun…
    
    LikeLike
  5. Greg Ichneumon Brown
    
    September 2, 2014
    
    Thanks for the background on the language analyzers (and presumably for building a lot of them 🙂 ).
    
    Is there a list somewhere of how they are being evaluated? We’re not doing as much systematic evaluation as I’d like yet. We don’t have everything in place to make that worth the effort yet, but its something I want us to spend more time on.
    
    Just as some more background, part of the reason I like ICU so much is that it can mostly be applied across all languages. This helps in cases where the language is unknown or there are a mix of languages being searched across.
    
    “resumé” wasn’t necessarily the best example I could come up with, but looking at 170k en searches from a few hours of logs I see 7 searches: 4 searches for “resume”, 1 for “iit resume”, 1 for “College resume”, 1 for “resume writer”, and none of “resumé”. I’m kinda doubtful “resume” is used as a verb very often in web search. 🙂
    
    That’s diving into the weeds a bit (and me being overly nit-picky), I’ve just seen cases where search is considered “broken” by users if the search engine doesn’t correct for these sorts of issues. One case that comes to mind is one of our VIP clients: http://olympic.ca/ and http://olympique.ca/
    
    Naturally there are a lot of names with accents that need to be handled well whether the user is searching in English or French, and English speaking users rarely type accents.
    
    I believe that using the Porter stemmer results in similar feelings from users that search is “broken”, but I only have anecdotal feedback as I haven’t tested click through rates.
    
    This is pretty specific to web search though where I think users have been trained by Google what to expect. Depending on the application though ICU certainly isn’t always the best way to go, and I understand not wanting to bundle with Lucene or ES.
    
    Cheers
    
    LikeLike
  6. Simon
    
    September 2, 2014
    
    In our case (Pixabay.com that is) we stick with the custom analyzer described above. Unfortunately, there’s no (simple) perfect way of handling these special characters:
    
    Using our own custom analyzer, there’s a difference between “bluten” (bleeding) and “blüten” (blossoms). The new analyzer folds the “ü” to “u”, so both terms become the same -> big problem for us! Same issue as with the previous built-in analyzer(s).
    
    However, there’s an issue with plural forms: e.g. “häuser” (houses) is the plural of “haus” (house). Using the built-in analyzer including stemmer, there’s no difference between “häuser” and “haus”, because the stemmed term is in both cases “haus”. That’s good. With our own analyzer, stemming won’t work in this case, because the stemmed form of “haeuser” is “haeus” – and not “haus”. Thus, plural and singular are treated as different terms. There are several German nouns that show this behavior, but as I said, in our case, the custom analyzer is probably the better choice.
    
    LikeLiked by 1 person

Florian

May 22, 2014

very helpful article – thanks a lot greg!

i have two questions:
– what would be a good way to deal with a non detected/defined language? i build a mapping along the lines of the gist Michael posted. each language needs to be defined… content.en, content.ja, etc. how would i deal with a language that had not been defined there?

– is there a way to use the langdetect plugin to also add/populate a field in the mapping that would contain the language code – for example to use it as a filter?

cheers
_f

LikeLike

Greg Ichneumon Brown

May 23, 2014

Both of your questions would probably be good feature requests for the langdetect plugin. We still make a separate call to ES to do language detection and then set our lang_analyzer field to indicate which analyzer to apply. There’s three reasons we do this:
– langdetect does not support every language
– We do not have a custom analyzer for every language, some need to fall back on our default analyzer (eg Latvian).
– We have other potential fallbacks we can use if the language detection fails. For example: user settings, lang detection on other content, or predicting based on other user behavior.

LikeLike

Florian

May 27, 2014

using the detection separately (or inferring the language from ui settings etc.), works fine for me too. i would have to send the name of the analyzer to use with the query though and i ran into a small problem:

	GET /entries/_search
	{
	"query": {
	"function_score": {
	"query": {
	"filtered": {
	"query": {
	"bool": {
	"must": [
	{
	"multi_match": {
	"fields": [
	"headline",
	"content",
	"comments.content"
	],
	"query": "明日が楽しみ",
	"use_dis_max": true
	}
	}
	],
	"minimum_should_match": 0,
	"should": [
	]
	}
	},

	"filter": {

	}

	}
	}
	},
	"analyzer": "ja_analyzer"
	},

	"sort": {
	"_score": {
	"order": "desc"
	}
	},

	"highlight" : {
	"fields" : {
	"headline" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] },
	"content" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] },
	"comments.content" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] }
	}
	},

	"size": "20",
	"from": 0
	}

view raw

es_query.json

hosted with ❤ by GitHub

if i use this query i do net get the highlights. if i remove the analyzer parameter i do get hightlights, but it then uses the default analyzer…

am i doing something wrong with the parameter? do you have an example query somewhere that you could post that shows how you send the language/analyzer parameter with the query?

thanks.

LikeLike

rcmuir

September 2, 2014

Hmm, sorry to hear you have to do hacks for Latvian. Actually ES has Latvian support, but somehow its missing from the documentation. I’ll fix and make sure no others are missing.

LikeLike
Greg Ichneumon Brown

September 2, 2014

That’d be awesome, thanks!

LikeLike

Prashanth

September 16, 2014

smartcn_word and smartcn_sentence are no longer available from the plugin. How do you modify your configuration to use the smartcn analyzer and smartcn_tokenizer? Thanks.

LikeLike

Reply
1. Greg Ichneumon Brown
  
  September 17, 2014
  
  Hi Prashanth,
  
  We’ve only just come across this problem as we start upgrading to ES 1.3.2. For new indices we’re changing to using the smartcn_tokenizer with no additional filters. We have existing indices using smartcn_sentence and smartcn_word. Our plan is to hack the smartcn plugin to make these an alias for smartcn_tokenizer, but we haven’t completed that work yet.
  
  We’ll try submitting our changes as a pull request to the plugin, but not sure whether that’s something they’d like.
  
  LikeLike
  
  Reply
  1. Prashanth
    
    September 19, 2014
    
    Thanks Greg. Does your chinese analyzer look like this now?
    
    “zh_analyzer”: {
    “type”: “smartcn”,
    “filter”: “smartcn_tokenizer”,
    “tokenizer”: “smartcn_tokenizer”,
    }
    
    LikeLike
  2. Greg Ichneumon Brown
    
    September 19, 2014
    
    Almost. You don’t need a filter. The tokenizer does all the work.
    
    LikeLike
  3. Prashanth
    
    September 19, 2014
    
    Thank you!
    
    LikeLike
Simon

October 12, 2014

Is there a reason for not using a stemmer on Bulgarian?

LikeLike

Reply
1. Simon
  
  October 12, 2014
  
  Got it 🙂 I just read you’re only using *minimal* stemming if available …
  
  LikeLiked by 1 person
  
  Reply
Marc Vermeulen (@cyclomarc)

March 27, 2015

Hello, I wonder whether you are storing your multi-lingual content in one field or in multiple fields ? I think you are using one field and then specify _analyzer and a property in which the language to be used during indexing is specified.

However, I read from the ES doc that _analyzer will be deprecated as of version 1.5.

See: https://github.com/elastic/elasticsearch/issues/9279

Do you have any ideas on this regression ? I think the above described solution will no loner work ?

thx
Marc

LikeLike

Reply
1. Greg Ichneumon Brown
  
  March 27, 2015
  
  Hi Marc,
  
  Thanks for that link. I hadn’t seen that issue yet.
  
  You’re correct, the above methods will not work with the proposed removal of _analyzer. I understand (and agree with) some of the reasoning leading to that decision, but it feels hasty to not find other ways to improve the situation.
  
  Having 100s of fields explicitly specified by the application seems problematic. I need to try rebuilding our index mapping and queries to understand what the implications are.
  
  LikeLike
  
  Reply
Sumanth Bandi

May 18, 2017

icu_folding wrongly modifies japanese characters, leading to a complete change in the meaning, for eg icu_folding of パリ returns ハリ. Do not use icu_folding for japanese

LikeLike

Reply
1. Greg Ichneumon Brown
  
  May 18, 2017
  
  Hmmm, that’s frustrating. Thanks for the heads up. Will look at fixing it.
  
  LikeLike
  
  Reply
Lourdes

June 2, 2017

Hi Greg,
I’m using Elasticsearch 2.3.1 and following this documentation https://www.elastic.co/guide/en/elasticsearch/guide/current/language-intro.html for languages analyzers I’m trying to sort results based on languages.

As I saw in documentation mentioned Korean and Japanese I tried to use the analyzers. But I got an exception “analyzer not found for field” for those languages, for others languages I have no problem.
can you point me to what is the better way to sort these languages? or if you can recommend me some plugins that can help me.

Thanks for your help.

LikeLike

Reply
1. Greg Ichneumon Brown
  
  June 2, 2017
  
  Ya, I really need to write an updated version of this post. Our analyzer configuration is here: https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-analyzer-builder.php
  
  For each field that we want to be analyzed for different languages we create the mappings with the following function: https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-wp-mappings.php#L282
  
  That gives us fields such as content.default, content.en, content.ja, etc
  
  We are running this code in production on ES 2.3 and 2.4.
  
  Hope that helps.
  
  LikeLike
  
  Reply
  1. Lourdes
    
    June 4, 2017
    
    Thanks Greg, and first sorry but I’m still confused and trying to get this working..
    I see on those links what you mention, and I see for japanese (ja) some spsecific customization kuromoji or cjk and others..
    and I don´t understand how to use it.
    doing this in my mapping, and the following search http://localhost:9200/my_index/_search?pretty=true -d
    works for the other languages.. except for japanese and korean.. [analyzer not found error]
    example:
    in my mapping:
    “type_jp”: {
    “type”: “string”,
    “analyzer”: “japanese”,
    “fields”: {
    “raw”: {
    “type”: “string”,
    “analyzer”: “case_insensitive_sort”
    }
    }
    }
    my search for language:
    {
    “from”: 0,
    “size”: 50,
    “query”: {
    “match_all”: {
    
    }
    },
    “sort”: [{
    “source.sender.type_jp.raw”: {
    “order”: “asc”,
    “nested_path”: “source.sender”
    }
    }]
    }
    do I need to add kuromoji plugin for japanese and another for korean?
    sorry again for ask.. I’ve been reading a lot but I can´t find how to manage these languages.
    
    LikeLiked by 1 person
2. Greg Ichneumon Brown
  
  June 7, 2017
  
  Hmmm, maybe you don’t have the Kuromoji ES plugin installed?
  
  https://www.elastic.co/guide/en/elasticsearch/plugins/current/analysis-kuromoji.html
  
  LikeLike
  
  Reply
  1. Lourdes
    
    June 8, 2017
    
    Yes, I could install and use that plugin. thanks!
    
    LikeLike

Three Principles for Multilingal Indexing in Elasticsearch

1) Use very light or minimal stemming to avoid losing semantic information.

2) Use stop words for those languages that we have them for.

3) Try and retain term consistency across all analyzers.

The Details (there’s always exceptions to rules)

Share this:

Comments

66 responses to “Three Principles for Multilingal Indexing in Elasticsearch”

Leave a comment Cancel reply