Three Principles for Multilingal Indexing in Elasticsearch

Recently I’ve been working on how to build Elasticsearch indices for WordPress blogs in a way that will work across multiple languages. Elasticsearch has a lot of built in support for different languages, but there are a number of configuration options to wade through and there are a few plugins that improve on the built in support.

Below I’ll lay out the analyzers I am currently using. Some caveats before I start. I’ve done a lot of reading on multi-lingual search, but since I’m really only fluent in one language there’s lots of details about how fluent speakers of other languages use a search engine that I’m sure I don’t understand. This is almost certainly still a work in progress.

In total we have 30 analyzers configured and we’re using the elasticsearch-langdetect plugin to detect 53 languages. For WordPress blogs, users have sometimes set their language to the same language as their content, but very often they have left it as the default of English. So we rely heavily on the language detection plugin to determine which language analyzer to use.

Update: In comments, Michael pointed out that since this post was written the langdetect plugin now has a custom mapping that the mapping example below is not using. I’d highly recommend checking it out for any new implementations.

For configuring the analyzers there are three main principles I’ve pulled from a number of different sources.

1) Use very light or minimal stemming to avoid losing semantic information.

Stemming removes the endings of words to make searches more general, however it can lose a lot of meaning in the process. For instance, the (quite popular) Snowball Stemmer will do the following:

computation -> comput
computers -> comput
computing -> comput
computer -> comput
computes -> comput

international -> intern
internationals -> intern
intern -> intern
interns -> intern

A lot of information is lost in doing such a zealous transformation. There are some cases though where stemming is very helpful. In English, stemming off the plurals of words should rarely be a problem since the plural is still referring to the same concept. This article on SearchWorkings gives further discussion of the pitfalls of the Snowball Stemmer, and leads to Jacque Savoy’s excellent paper on stemming and stop words as applied to French, Italian, German, and Spanish. Savoy found that doing minimal stemming of plurals and feminine/masculine forms of words performed well for these languages. The minimal_* and light_* stemmers included in Elasticsearch implement these recommendations allowing us to take a limited stemming approach.

So when there is a minimal stemmer available for a language we use it, otherwise we do not do any stemming at all.

2) Use stop words for those languages that we have them for.

This ensures that we reduce the size of the index and speed up searches by not trying to match on very frequent terms that provide very little information. Unfortunately, stop words will break certain searches. For instance, searching for “to be or not to be” will not get any results.

The new (to 0.90) cutoff_frequency parameter on the match query may provide a way to allow indexing stop words, but I currently am still unsure whether there are other implications on other types of queries, or how I would decide what cutoff frequency to use given the wide range of documents and languages in a single index. The very high number of English documents compared to say Hebrew also means that Hebrew stopwords may not be frequent enough to trigger the cutoff frequencies correctly if searching across all documents.

For the moment I’m sticking with the stop words approach. Weaning myself off of them will require a bit more experimentation and thought, but I am intrigued by finding an approach that would allow us to avoid the limitations of stop words and enable finding every blog post referencing Shakespeare’s most famous quote.

3) Try and retain term consistency across all analyzers.

We use the ICU Tokenizer for all cases where the language won’t do significantly better with a custom tokenizer. Japanese, Chinese, and Korean all require smarter tokenization, but using the ICU Tokenizer ensures we treat other languages in a consistent manner. Individual terms are then filtered using the ICU Folding and Normalization filters to ensure consistent terms.

Folding converts a character to an equivalent standard form. The most common conversion that ICU Folding provides is converting characters to lower case as defined in this exhaustive definition of case folding. But folding goes far beyond lowercasing, there are symbols in many languages where multiple characters essentially mean the same thing (particularly from a search perspective). UTR30-4 defines the full set of foldings that the ICU Folding performs.

Where Folding converts a single character to a standard form, Normalization converts a sequence of characters to a standard form. A good example of this, straight from Wikipedia, is “the code point U+006E (the Latin lowercase “n”) followed by U+0303 (the combining tilde “◌̃”) is defined by Unicode to be canonically equivalent to the single code point U+00F1 (the lowercase letter “ñ” of the Spanish alphabet).” Another entertaining example of character normalization is that some Roman numerals (Ⅸ) can be expressed as a single UTF-8 character. But of course for search you’d rather have that converted to “IX”. The ICU Normalization sections have links to the many docs defining how normalization is handled.

By indexing using these ICU tools we can be fairly sure that searching across all documents, regardless of language, with just a default analyzer will give results for most queries.

The Details (there’s always exceptions to rules)

  • Asian languages that do not use whitespace for word separations present a non-trivial problem when indexing content. ES comes with the built in CJK analyzer that indexes every pair of symbols into a term, but there are plugins that are much smarter about how to tokenize the text.
    • For Japanese (ja) we are using the Kuromoji plugin built on top of the seemingly excellent library by Atilika. I don’t know any Japanese, so really I am probably just impressed by their level of documentation, slick website, and the fact that they have an online tokenizer for testing tokenization.
    • There are a couple of different versions of written Chinese (zh), and the language detection plugin distinguishes between zh-tw and zh-cn. For analysis we use the ES Smart Chinese Analyzer for all versions of the language. This is done out of necessity rather than any analysis on my part. The ES plugin wraps the Lucene analyzer which performs sentence and then word segmentation using a Hidden Markov Model.
    • Unfortunately there is currently no custom Korean analyzer for Elasticsearch that I have come across. For that reason we are only using the CJK Analyzer which takes each bi-gram of symbols as a term. However, while writing this post I came across a Lucene mailing list thread from a few days ago which says that a Korean analyzer is in the process of being ported into Lucene. So I have no doubt that will eventually end up in ES or as an ES plugin.
  • Elasticsearch doesn’t have any built in stop words for Hebrew (he) so we define a custom list pulled from an online list (Update: this site doesn’t exist anymore, our list of stopwords is located here). I had some co-workers cull the list a bit to remove a few of the terms that they deemed a bit redundant. I’ll probably end up doing this for some other languages as well if we stick with the stop words approach.
  • Testing 30 analyzers was pretty non-trivial. The ES Inquisitor plugin’s Analyzers tab was incredibly useful for interactively testing text tokenization and stemming against all the different language analyzers to see how they functioned differently.

Finally we come to defining all of these analyzers. Hope this helps you in your multi-lingual endeavors.

Update [Feb 2014]: The PHP code we use for generating analyzers is now open sourced as a part of the wpes-lib project. See that code for the latest methods we are using.

Update [May 2014]: Based on the feedback in the comments and some issues we’ve come across running in production I’ve updated the mappings below. The changes we made are:

  • Perform ICU normalization before removing stopwords, and ICU folding after stopwords. Otherwise stopwords such as “même” in French will not be correctly removed.
  • Adjusted our Japanese language analysis based on a slightly adjusted use of GMO Media’s methodology. We were seeing a significantly lower click through rate on Japanese related posts than for other languages, and there was pretty good evidence that the morphological language analysis would help.
  • Added the Elision Token filter to French. “l’avion” => “avion”

Potential improvements I haven’t gotten a chance to test yet because we need to run real performance tests to be sure they will actually be an improvement:

  • Duplicate tokens to handle different spellings (eg “recognize” vs “recognise”).
  • Morphological analysis of en and ru
  • Should we run spell checking or phonetic analysis
  • Include all stopwords and rely on cutoff_frequency to avoid the performance problems this will introduce
  • Index bigrams with the shingle analyzer
  • Duplicate terms, stem them, then unique the terms to try and index both stemmed and non-stemmed terms

Thanks to everyone in the comments who have helped make our multi-lingual indexing better.

{
  "filter": {
    "ar_stop_filter": {
      "type": "stop",
      "stopwords": ["_arabic_"]
    },
    "bg_stop_filter": {
      "type": "stop",
      "stopwords": ["_bulgarian_"]
    },
    "ca_stop_filter": {
      "type": "stop",
      "stopwords": ["_catalan_"]
    },
    "cs_stop_filter": {
      "type": "stop",
      "stopwords": ["_czech_"]
    },
    "da_stop_filter": {
      "type": "stop",
      "stopwords": ["_danish_"]
    },
    "de_stop_filter": {
      "type": "stop",
      "stopwords": ["_german_"]
    },
    "de_stem_filter": {
      "type": "stemmer",
      "name": "minimal_german"
    },
    "el_stop_filter": {
      "type": "stop",
      "stopwords": ["_greek_"]
    },
    "en_stop_filter": {
      "type": "stop",
      "stopwords": ["_english_"]
    },
    "en_stem_filter": {
      "type": "stemmer",
      "name": "minimal_english"
    },
    "es_stop_filter": {
      "type": "stop",
      "stopwords": ["_spanish_"]
    },
    "es_stem_filter": {
      "type": "stemmer",
      "name": "light_spanish"
    },
    "eu_stop_filter": {
      "type": "stop",
      "stopwords": ["_basque_"]
    },
    "fa_stop_filter": {
      "type": "stop",
      "stopwords": ["_persian_"]
    },
    "fi_stop_filter": {
      "type": "stop",
      "stopwords": ["_finnish_"]
    },
    "fi_stem_filter": {
      "type": "stemmer",
      "name": "light_finish"
    },
    "fr_stop_filter": {
      "type": "stop",
      "stopwords": ["_french_"]
    },
    "fr_stem_filter": {
      "type": "stemmer",
      "name": "minimal_french"
    },
    "he_stop_filter": {
      "type": "stop",
      "stopwords": [/*excluded for brevity*/]
    },
    "hi_stop_filter": {
      "type": "stop",
      "stopwords": ["_hindi_"]
    },
    "hu_stop_filter": {
      "type": "stop",
      "stopwords": ["_hungarian_"]
    },
    "hu_stem_filter": {
      "type": "stemmer",
      "name": "light_hungarian"
    },
    "hy_stop_filter": {
      "type": "stop",
      "stopwords": ["_armenian_"]
    },
    "id_stop_filter": {
      "type": "stop",
      "stopwords": ["_indonesian_"]
    },
    "it_stop_filter": {
      "type": "stop",
      "stopwords": ["_italian_"]
    },
    "it_stem_filter": {
      "type": "stemmer",
      "name": "light_italian"
    },
    "ja_pos_filter": {
      "type": "kuromoji_part_of_speech",
      "stoptags": ["\\u52a9\\u8a5e-\\u683c\\u52a9\\u8a5e-\\u4e00\\u822c", "\\u52a9\\u8a5e-\\u7d42\\u52a9\\u8a5e"]
    },
    "nl_stop_filter": {
      "type": "stop",
      "stopwords": ["_dutch_"]
    },
    "no_stop_filter": {
      "type": "stop",
      "stopwords": ["_norwegian_"]
    },
    "pt_stop_filter": {
      "type": "stop",
      "stopwords": ["_portuguese_"]
    },
    "pt_stem_filter": {
      "type": "stemmer",
      "name": "minimal_portuguese"
    },
    "ro_stop_filter": {
      "type": "stop",
      "stopwords": ["_romanian_"]
    },
    "ru_stop_filter": {
      "type": "stop",
      "stopwords": ["_russian_"]
    },
    "ru_stem_filter": {
      "type": "stemmer",
      "name": "light_russian"
    },
    "sv_stop_filter": {
      "type": "stop",
      "stopwords": ["_swedish_"]
    },
    "sv_stem_filter": {
      "type": "stemmer",
      "name": "light_swedish"
    },
    "tr_stop_filter": {
      "type": "stop",
      "stopwords": ["_turkish_"]
    }
  },
  "analyzer": {
    "ar_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ar_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "bg_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "bg_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ca_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ca_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "cs_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "cs_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "da_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "da_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "de_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "de_stop_filter", "de_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "el_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "el_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "en_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "en_stop_filter", "en_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "es_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "es_stop_filter", "es_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "eu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "eu_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fa_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fa_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "fi_stop_filter", "fi_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "fr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "elision", "fr_stop_filter", "fr_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "he_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "he_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hi_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hi_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hu_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hu_stop_filter", "hu_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "hy_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "hy_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "id_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "id_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "it_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "it_stop_filter", "it_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ja_analyzer": {
      "type": "custom",
      "filter": ["kuromoji_baseform", "ja_pos_filter", "icu_normalizer", "icu_folding", "cjk_width"],
      "tokenizer": "kuromoji_tokenizer"
    },
    "ko_analyzer": {
      "type": "cjk",
      "filter": []
    },
    "nl_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "nl_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "no_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "no_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "pt_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "pt_stop_filter", "pt_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ro_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ro_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "ru_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "ru_stop_filter", "ru_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "sv_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "sv_stop_filter", "sv_stem_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "tr_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "tr_stop_filter", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    },
    "zh_analyzer": {
      "type": "custom",
      "filter": ["smartcn_word", "icu_normalizer", "icu_folding"],
      "tokenizer": "smartcn_sentence"
    },
    "lowercase_analyzer": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "keyword"
    },
    "default": {
      "type": "custom",
      "filter": ["icu_normalizer", "icu_folding"],
      "tokenizer": "icu_tokenizer"
    }
  },
  "tokenizer": {
    "kuromoji": {
      "type": "kuromoji_tokenizer",
      "mode": "search"
    }
  }
}

 


Posted

in

by

Comments

66 responses to “Three Principles for Multilingal Indexing in Elasticsearch”

  1. Gregor Avatar
    Gregor

    So for indexing you use the language detection plugin to determine the language of the document and use the corresponding analyzer.
    And for searching you always relay on the default analyzer without attempting to “guess” the language?

    Like

    1. Greg Avatar
      Greg

      For indexing, yes we do language detection to select the analyzer.

      When querying, it depends. If we have a good guess at the user’s language (ie they are on de.search.wordpress.com or the site they are on has a particular language selected) then we can use the appropriate language. But when we don’t have a good guess, then we can fall back to the default analyzer which should work pretty well across most languages.

      Ideally we try and use the appropriate language analyzer, but there are definitely cases where I know we won’t be able to so having a fallback is important. The biggest concern with the fallback is how stemming will truncate terms. Hopefully using only minimal stemming will minimize how much impact this has.

      I haven’t done any deep analysis of what impact this has on search relevancy yet though.

      Like

      1. Nate Avatar
        Nate

        Greg,

        So when you say you do “language detection” are you doing this independently from elastic search? Or is there a way to tie content.lang as set by the plugin to a particular analyzer automatically? I am very new to elastic search and it would be helpful to know.

        Like

      2. Greg Avatar
        Greg

        Hi Nate

        We run the elasticsearch-langdetect plugin on the same ES cluster and then when indexing first make a call to it to determine the language of the content of the doc. Then we make a separate call to index the document.

        I don’t believe there is a way to index the document and determine the language at the same time.

        It’s also possible to run the langdetect code independent of ES (potentially in your client), but for us using the ES plugin made it easier to deploy and it doesn’t add much load to the cluster.

        Like

      3. Nate Avatar
        Nate

        Thanks for the prompt reply! Makes sense.

        Like

  2. […] Pro další studium ICU může sloužit článek Three Principles for Multilingal Indexing in Elasticsearch. […]

    Like

  3. Avi G Avatar
    Avi G

    Amazing post! Helped me a lot. Thank you for all the information!

    Like

  4. Ale Avatar
    Ale

    Hei,
    It’s not clear for me how do you decide which analyzer to use depending on the field’s content. Do you have a field for each language, or were you able to use different analyzers for the same field at indexing ?

    Thanks !

    Like

    1. Greg Avatar
      Greg

      You can specify the analyzer to use when indexing. In my case I have a field for each document called lang_analyzer which specifies which analyzer to use for the document.

      You configure which field is used for specifying the analysis in the _analyzer mapping field.

      For querying you either need to specify the analyzer or you just rely on the default. Using the ICU plugins for analysis ensures consistent tokenization across all languages so that the default should work pretty well.

      Like

      1. Ale Avatar
        Ale

        Thanks! It really helped.

        Like

  5. Daniel Avatar
    Daniel

    Hi Greg! First of all, great post, thanks for it!
    How would you go about if your data was region names in various languages, for instance, I have more than 100k regions, and one document in ES contains the names Munich, München, Munique, etc. Same goes for the 100k+ regions. Having one document per language would make my index grow a lot.
    What I want to have is an auto complete where people can search regions, but I don’t really know the language they best know the region, so they can be seeing the site in English but searching the region in German. So to have an educated guess of the language is hard. Do you think a set up like the one you presented would be appropriate for data such as the one I stated? Or would you do something different?

    Thanks a lot,
    Daniel

    Like

    1. Greg Avatar
      Greg

      If I understand the use case, I think you could just use the ICU tokenizer, folding, and normalization on a single field without any stemming or stop words (the “default” analyzer in the code above). If you are only indexing place names across multiple languages you shouldn’t need stemming/stop words anyways. ICU should give you results that work pretty well across European languages at least. You wouldn’t have any fancy tokenization of Korean, Japanese, or Chinese. I don’t know enough about place names in those languages to know how big a problem that would be.

      If all of the place names you have are already separated, then be sure to index them as an array of strings, and consider indexing them as an analyzed and a non-analyzed (see the multi-field type mapping example).

      That way you can retain the original text and sequence of words. Probably some other details to work out to get auto suggest working well also, but I haven’t yet played with the new suggest features.

      Like

      1. Daniel Avatar
        Daniel

        Thanks for the feedback Greg, I’m trying some stuff to see how it works, and what is faster, and your tips certainly helped.

        Thank you

        Like

  6. Michael Avatar
    Michael

    any idea how to plug in the polish (stempel) analyzer? have you tried it?

    Like

    1. Greg Avatar
      Greg

      I haven’t tried it. We probably should be using it. 🙂

      Like

  7. Michael Avatar
    Michael

    also, how does one uses the elasticsearch-langdetect plugin to automatically apply the right analyzer based on computed language?

    Like

    1. Greg Avatar
      Greg

      You can’t auto apply the analyzer to a field unfortunately. You need to make one request to analyze a block of text and get the language and then a separate request to index the data with the appropriate analyzer specified.

      Like

      1. Greg Avatar
        Greg

        Oh cool! The langdetect plugin has been updated since I originally wrote this post, and I hadn’t noticed that change.

        Yes, I think that should work. I’ll need to use this method in the future. Thanks!

        Like

      2. Michael Avatar
        Michael

        unfortunately the _langdetect method is wayyy inaccurate, especially for short phrases..

        Like

      3. Greg Avatar
        Greg

        Ya, I have some custom client code wrapping my call to langdetect so that if there is less than 300 chars of actual text then we don’t bother running it and use some fallbacks.

        I hacked together a quick (probably not working) gist of how we call langdetect: https://gist.github.com/gibrown/8652399

        Might be good to submit an issue against the plugin with specific examples. Short text is generally a harder problem, but there may be some simple changes that will make things better.

        Like

  8. Gregor Avatar

    Excellent article. I thought readers might be interested in Rosette Search Essentials for Elasticsearch, from Basis Technologies, which we launched last night at hack/reduce in Cambridge, MA. It’s a plugin that does three neat things that improve multilingual search quality:

    – Intelligent CJKT tokenization/segmentation
    – Lemmatization: performs morphological analysis to find the “lemma” or dictionary form of the words in your documents, which is far superior to stemming.
    – Decompounding: languages like German contain compound words that don’t always make great index terms. We break these up into their constituents so you can index them too.

    Handles Arabic, Chinese, Czech, Danish, Dutch, English, French, German, Hebrew, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Swedish, Thai and Turkish.

    Check it out here: http://basistech.com/elasticsearch

    Read a bit more about the recall and precision benefits that lemmatization and decompounding can offer here: See this paper for examples: http://www.basistech.com/search-in-european-languages-whitepaper/)

    I’m the Director of Product Management at Basis. I would love feedback on the product and to hear from anyone who has gnarly multilingual search problems.

    Like

    1. Greg Avatar
      Greg

      Hi Gregor, thanks for pointing this out and for working to make multi-lingual search better.

      I pretty strongly recommend against using a closed source solution such as yours for something so fundamental as search. My reasoning got lengthy, so I turned it into a full post.

      Happy to discuss more, either publicly or privately.

      Cheers.

      Like

  9. slushi Avatar
    slushi

    the link to hebrew stop words seems to be broken. any ideas on where a good list can be found?

    Like

    1. Greg Avatar
      Greg

      Thanks for pointing that out.

      Our complete stop word list is available here: https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-analyzer-builder.php#L351

      Again, I do not speak/read Hebrew, but have had native speakers look at the list. However, they are not NLP researchers.

      If anyone has any suggested updates, please submit a pull request on that repository.

      Like

  10. slushi Avatar
    slushi

    I tried out the above settings. I suspected that the above definition could cause issues when language specific stop words contain “special” characters that would be folded into ascii characters. I built a gist that demonstrates the problem in french.


    result=`curl -s -XDELETE 'http://localhost:9200/test?pretty=true'`
    echo "$result"
    echo "attempting index creation"
    result=`curl -s -XPOST 'http://localhost:9200/test?pretty=true' -d '{
    "index" : {
    "analysis" : {
    "analyzer" : {
    "_fr" : {
    "type": "custom",
    "tokenizer": "icu_tokenizer",
    "filter": ["icu_folding", "icu_normalizer", "fr_stop_filter", "fr_stem_filter"]
    },
    "_fr2" : {
    "type": "custom",
    "tokenizer": "icu_tokenizer",
    "filter": ["icu_normalizer", "fr_stop_filter", "fr_stem_filter"]
    },
    "default": {
    "type": "custom",
    "tokenizer": "icu_tokenizer",
    "filter": ["icu_folding", "icu_normalizer"]
    }
    },
    "filter" : {
    "fr_stop_filter": {
    "type": "stop",
    "stopwords": ["_french_"]
    },
    "fr_stem_filter": {
    "type": "stemmer",
    "name": "light_french"
    }
    }
    }
    }
    }'`
    echo $result
    result=`curl -s "http://localhost:9200/_analyze?analyzer=french&pretty=true&text=M%C3%AAme"`
    echo 'french analyzer: ' $result
    result=`curl -s "http://localhost:9200/test/_analyze?analyzer=_fr&pretty=true&text=M%C3%AAme"`
    echo 'folding/normalizing analyzer' $result
    result=`curl -s "http://localhost:9200/test/_analyze?analyzer=_fr2&pretty=true&text=M%C3%AAme"`
    echo 'normalizing analyzer' $result

    view raw

    gistfile1.txt

    hosted with ❤ by GitHub

    Did you guys decide this is acceptable? I think if the folding filter is moved to the end of the filter chain, this issue would disappear, but I don’t know what other effects that would have.

    Like

    1. Greg Avatar
      Greg

      Wow, you’re totally right. No, its not really acceptable, definitely a bug. Thanks!

      I think the folding filter should be last in the list, or we should use custom stopword lists that have the characters already folded. Probably this:

      "filter": ["icu_normalizer", "fr_stop_filter", "fr_stem_filter", "icu_folding"]
      

      This bug probably doesn’t affect search quality too much. It only applies to a few words in each language. However, including stop words in the index definitely makes the index bigger and could significantly slow down searches.

      We’ll have to do some experimentation to figure out what the right filtering is. Will be interesting to see how much of a performance improvement we get from this change.

      FYI, character folding is definitely very worthwhile. We did some work with one of our VIPs on a French site, and without character folding there were definitely complaints about the search.

      Thanks again!

      Like

  11. jettro Avatar

    Thanks for the nice article. One of the links is dead. The article on searchworkings has moved to: http://blog.trifork.com/2011/12/07/analysing-european-languages-with-lucene/

    regards Jettro

    Like

    1. Greg Ichneumon Brown Avatar
      Greg Ichneumon Brown

      Link updated. Thanks for the heads up!

      Like

  12. Simon Avatar

    I love this post – come back here from time to time, because you’re regularly updating it – thanks for that! Learned a lot here! We’ve used that information for improving search results on our multilingual site Pixabay.com (20 languages).

    To give back something – as a German based company, we could fine tune some things for search in German:

    Instead of plan “icu_folding” one should better use a customized filter and exclude a few special characters:

    “filter”: {
    “de_icu_folding”: { “type”: “icu_folding”, “unicodeSetFilter”: “[^ßÄäÖöÜü]” },
    “de_stem_filter”: { “type”: “stemmer”, “name”: “minimal_german” },
    }

    Then, add a char filter to transform the excluded characters:

    “char_filter”: {
    “de_char_filter”: {
    “type”: “mapping”,
    “mappings”: [u”ß=>ss”, u”Ä=>ae”, u”ä=>ae”, u”Ö=>oe”, u”ö=>oe”, u”Ü=>ue”, u”ü=>ue”, “ph=>f”]
    }
    }

    Put it all together in the analyzer:

    “de_analyzer”: {
    “type”: “custom”, “tokenizer”: “icu_tokenizer”,
    “filter”: [“de_stop_filter”, “de_icu_folding”, “de_stem_filter”, “icu_normalizer”],
    “char_filter”: [“de_char_filter”]
    }

    Advantage: For example there are the words like “blut” and “blüte” in German, meaning “blood” and “blossom”. Using standard icu_folding, both terms are treated exactly the same way. With the custom char filter, results work as expected. The character “ü” may be written as “ue” in German, which is what the transformation basically does.

    Like

    1. Greg Ichneumon Brown Avatar
      Greg Ichneumon Brown

      This is very helpful, thanks.

      I’ve been testing these changes out today, and I’m looking at adding this with a few slight changes into wpes-lib:
      – I just used the default icu_folding because as far as I could tell the char_filter will have changed these characters anyways
      – I also changed the order of the filters to put the normalizer first since one of the reasons for this filter is to combine multi-character sequences into one character before folding.

      I think both of these changes matter more when you are dealing with multi-lingual content in a single document. Any problems you see with this? For your examples it seems to still work well.

      I’m also curious if you have looked at all at using a decompounder in German.

      Like

      1. Simon Avatar

        If the char_filter is applied before icu_folding takes place, it should work. In which order does ES go though those filters?

        I think, icu normalizer first makes totally sense – I’ll change that in our own code right away.

        Didn’t know about the decompounder so far – but it sounds great! Going to test this soon!

        Thanks, Simon

        Like

      2. Greg Ichneumon Brown Avatar
        Greg Ichneumon Brown

        ES always applies char filters first (even before tokenization), so ya that should work well.

        I’d be really interested to hear how the decompounder works for you. It feels like too big a change for me to universally change without doing some thorough testing of its performance. I’d also like to test it for multiple languages and just don’t have the time to devote to it right now.

        Thanks again for the help, I’m going to commit these changes and make them live when we rebuild our index in a few weeks.

        Like

      3. Simon Avatar

        Not sure if that’s interesting for you, but we also use a word delimiter filter for all latin languages, so not for ja, zh, ko: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/analysis-word-delimiter-tokenfilter.html

        “filter”: {
        “my_word_delimiter”: {
        “type”: “word_delimiter”,
        “generate_word_parts”: False,
        “catenate_words”: True,
        “catenate_numbers”: True,
        “split_on_case_change”: False,
        “preserve_original”: True
        },
        }

        Like

      4. Greg Ichneumon Brown Avatar
        Greg Ichneumon Brown

        Good to hear that works well for you.

        I have used a word delimiter on some smaller indices, but I (vaguely) remember running into problems in a few cases. I think I decided I didn’t have enough data to figure out how to configure it properly.

        I still feel like my analyzers don’t do a good job with product names and other words where punctuation or case is used as part of the word.

        I’m surprised you don’t use the same filter for ja, zh, and ko. I often see a lot of latin languages mixed in with Asian languages.

        Like

      5. Simon Avatar

        I guess it wouldn’t really hurt, but in our case, the delimiter also wouldn’t make a (relevant) difference for ja, ko, zh. We’re not dealing with full texts/sentences, but with a lot of keywords that are strictly separated into the different languages. There are a few latin names for cities, countries and the likes, but they would not be affected by the delimiter. So the delimiter would only cost a bit of performance with no real benefit …

        Like

      6. Simon Avatar

        I’ve looked at the German decompounder – in theory it really looks good and I’d like to use it. However, it’s not well maintained. The update frequency appears to be rather low and there’s no working version for the current ES server 1.1.x or 1.2.

        Like

      7. Greg Ichneumon Brown Avatar
        Greg Ichneumon Brown

        Thanks for the update.

        jprante has been pretty responsive to Github issues I’ve submitted elsewhere in the past, so maybe either submit one or even better build it locally and submit a pull request. My guess is that very little has changed in 1.1.x that would affect this.

        Like

    2. Rotem Avatar

      Do you think that the new built-in German analyzer (available in 1.3.X) is good enough, or do you still need the custom ICU folding?

      Like

      1. Greg Ichneumon Brown Avatar
        Greg Ichneumon Brown

        Thanks for pointing that out. We’re still on 1.2 and I hadn’t noticed the new analyzers yet.

        My 2 cents after briefly examining the analyzers (caveat: I don’t know German).

        Some differences with our analyzers:
        – German uses light_stemming rather than minimal_stemming. The ES docs link to this paper. My selection of minimal stemming is based largely on effects of Spanish/French languages. Quite possible light stemming would be better.
        – There is normalization, but it is language specific. I worry that this choice will run into problems with foreign words mixed into the language. Particularly words with accents. ICU seems like a very strong standard. I believe ICU is well regarded within ES, but they don’t bundle it mainly due to its size.
        – In English the analyzer is using the Porter stemmer. I disagree with this decision in most contexts. When I have used the Porter stemmer in real applications I’ve gotten pushback.
        – There appears to be no normalization in English. This means resume and resumé are two different terms.

        I think there are some interesting ideas in this set of analyzers. I like a lot that there are links to the papers that were used to justify the decisions. I also like that ES has tried to put together a basic set of analyzers for users even if I disagree with some of the details. Those details probably depend a lot on your application.

        Like

      2. Rotem Avatar

        These are good and helpful remarks. Thanks!

        Like

      3. rcmuir Avatar

        FYI these analyzers are just exposing the lucene per-language analyzers, which means they are specific to the needs of the language, typically undergone formal evaluation etc. They don’t depend on any external libraries like ICU.

        The german normalization, for example incorporates context-dependent handling that is pretty common in these analyzers, usually can’t be expressed as a unicode normalization form (i before e, except except after c, x only at the end of the word, etc), and usually isn’t appropriate for other languages.

        And in english, resume is not always a noun…

        Like

      4. Greg Ichneumon Brown Avatar
        Greg Ichneumon Brown

        Thanks for the background on the language analyzers (and presumably for building a lot of them 🙂 ).

        Is there a list somewhere of how they are being evaluated? We’re not doing as much systematic evaluation as I’d like yet. We don’t have everything in place to make that worth the effort yet, but its something I want us to spend more time on.

        Just as some more background, part of the reason I like ICU so much is that it can mostly be applied across all languages. This helps in cases where the language is unknown or there are a mix of languages being searched across.

        “resumé” wasn’t necessarily the best example I could come up with, but looking at 170k en searches from a few hours of logs I see 7 searches: 4 searches for “resume”, 1 for “iit resume”, 1 for “College resume”, 1 for “resume writer”, and none of “resumé”. I’m kinda doubtful “resume” is used as a verb very often in web search. 🙂

        That’s diving into the weeds a bit (and me being overly nit-picky), I’ve just seen cases where search is considered “broken” by users if the search engine doesn’t correct for these sorts of issues. One case that comes to mind is one of our VIP clients: http://olympic.ca/ and http://olympique.ca/

        Naturally there are a lot of names with accents that need to be handled well whether the user is searching in English or French, and English speaking users rarely type accents.

        I believe that using the Porter stemmer results in similar feelings from users that search is “broken”, but I only have anecdotal feedback as I haven’t tested click through rates.

        This is pretty specific to web search though where I think users have been trained by Google what to expect. Depending on the application though ICU certainly isn’t always the best way to go, and I understand not wanting to bundle with Lucene or ES.

        Cheers

        Like

      5. Simon Avatar

        In our case (Pixabay.com that is) we stick with the custom analyzer described above. Unfortunately, there’s no (simple) perfect way of handling these special characters:

        Using our own custom analyzer, there’s a difference between “bluten” (bleeding) and “blüten” (blossoms). The new analyzer folds the “ü” to “u”, so both terms become the same -> big problem for us! Same issue as with the previous built-in analyzer(s).

        However, there’s an issue with plural forms: e.g. “häuser” (houses) is the plural of “haus” (house). Using the built-in analyzer including stemmer, there’s no difference between “häuser” and “haus”, because the stemmed term is in both cases “haus”. That’s good. With our own analyzer, stemming won’t work in this case, because the stemmed form of “haeuser” is “haeus” – and not “haus”. Thus, plural and singular are treated as different terms. There are several German nouns that show this behavior, but as I said, in our case, the custom analyzer is probably the better choice.

        Liked by 1 person

  13. Florian Avatar
    Florian

    very helpful article – thanks a lot greg!

    i have two questions:
    – what would be a good way to deal with a non detected/defined language? i build a mapping along the lines of the gist Michael posted. each language needs to be defined… content.en, content.ja, etc. how would i deal with a language that had not been defined there?

    – is there a way to use the langdetect plugin to also add/populate a field in the mapping that would contain the language code – for example to use it as a filter?

    cheers
    _f

    Like

    1. Greg Ichneumon Brown Avatar
      Greg Ichneumon Brown

      Both of your questions would probably be good feature requests for the langdetect plugin. We still make a separate call to ES to do language detection and then set our lang_analyzer field to indicate which analyzer to apply. There’s three reasons we do this:
      – langdetect does not support every language
      – We do not have a custom analyzer for every language, some need to fall back on our default analyzer (eg Latvian).
      – We have other potential fallbacks we can use if the language detection fails. For example: user settings, lang detection on other content, or predicting based on other user behavior.

      Like

      1. Florian Avatar
        Florian

        using the detection separately (or inferring the language from ui settings etc.), works fine for me too. i would have to send the name of the analyzer to use with the query though and i ran into a small problem:


        GET /entries/_search
        {
        "query": {
        "function_score": {
        "query": {
        "filtered": {
        "query": {
        "bool": {
        "must": [
        {
        "multi_match": {
        "fields": [
        "headline",
        "content",
        "comments.content"
        ],
        "query": "明日が楽しみ",
        "use_dis_max": true
        }
        }
        ],
        "minimum_should_match": 0,
        "should": [
        ]
        }
        },
        "filter": {
        }
        }
        }
        },
        "analyzer": "ja_analyzer"
        },
        "sort": {
        "_score": {
        "order": "desc"
        }
        },
        "highlight" : {
        "fields" : {
        "headline" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] },
        "content" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] },
        "comments.content" : { "pre_tags" : [ "<b>" ], "post_tags" : [ "</b>" ] }
        }
        },
        "size": "20",
        "from": 0
        }

        view raw

        es_query.json

        hosted with ❤ by GitHub

        if i use this query i do net get the highlights. if i remove the analyzer parameter i do get hightlights, but it then uses the default analyzer…

        am i doing something wrong with the parameter? do you have an example query somewhere that you could post that shows how you send the language/analyzer parameter with the query?

        thanks.

        Like

      2. rcmuir Avatar

        Hmm, sorry to hear you have to do hacks for Latvian. Actually ES has Latvian support, but somehow its missing from the documentation. I’ll fix and make sure no others are missing.

        Like

      3. Greg Ichneumon Brown Avatar
        Greg Ichneumon Brown

        That’d be awesome, thanks!

        Like

  14. Prashanth Avatar
    Prashanth

    smartcn_word and smartcn_sentence are no longer available from the plugin. How do you modify your configuration to use the smartcn analyzer and smartcn_tokenizer? Thanks.

    Like

    1. Greg Ichneumon Brown Avatar
      Greg Ichneumon Brown

      Hi Prashanth,

      We’ve only just come across this problem as we start upgrading to ES 1.3.2. For new indices we’re changing to using the smartcn_tokenizer with no additional filters. We have existing indices using smartcn_sentence and smartcn_word. Our plan is to hack the smartcn plugin to make these an alias for smartcn_tokenizer, but we haven’t completed that work yet.

      We’ll try submitting our changes as a pull request to the plugin, but not sure whether that’s something they’d like.

      Like

      1. Prashanth Avatar
        Prashanth

        Thanks Greg. Does your chinese analyzer look like this now?

        “zh_analyzer”: {
        “type”: “smartcn”,
        “filter”: “smartcn_tokenizer”,
        “tokenizer”: “smartcn_tokenizer”,
        }

        Like

      2. Greg Ichneumon Brown Avatar
        Greg Ichneumon Brown

        Almost. You don’t need a filter. The tokenizer does all the work.

        Like

      3. Prashanth Avatar
        Prashanth

        Thank you!

        Like

  15. Simon Avatar

    Is there a reason for not using a stemmer on Bulgarian?

    Like

    1. Simon Avatar

      Got it 🙂 I just read you’re only using *minimal* stemming if available …

      Liked by 1 person

  16. Marc Vermeulen (@cyclomarc) Avatar

    Hello, I wonder whether you are storing your multi-lingual content in one field or in multiple fields ? I think you are using one field and then specify _analyzer and a property in which the language to be used during indexing is specified.

    However, I read from the ES doc that _analyzer will be deprecated as of version 1.5.

    See: https://github.com/elastic/elasticsearch/issues/9279

    Do you have any ideas on this regression ? I think the above described solution will no loner work ?

    thx
    Marc

    Like

    1. Greg Ichneumon Brown Avatar
      Greg Ichneumon Brown

      Hi Marc,

      Thanks for that link. I hadn’t seen that issue yet.

      You’re correct, the above methods will not work with the proposed removal of _analyzer. I understand (and agree with) some of the reasoning leading to that decision, but it feels hasty to not find other ways to improve the situation.

      Having 100s of fields explicitly specified by the application seems problematic. I need to try rebuilding our index mapping and queries to understand what the implications are.

      Like

  17. Sumanth Bandi Avatar
    Sumanth Bandi

    icu_folding wrongly modifies japanese characters, leading to a complete change in the meaning, for eg icu_folding of パリ returns ハリ. Do not use icu_folding for japanese

    Like

    1. Greg Ichneumon Brown Avatar

      Hmmm, that’s frustrating. Thanks for the heads up. Will look at fixing it.

      Like

  18. Lourdes Avatar
    Lourdes

    Hi Greg,
    I’m using Elasticsearch 2.3.1 and following this documentation https://www.elastic.co/guide/en/elasticsearch/guide/current/language-intro.html for languages analyzers I’m trying to sort results based on languages.

    As I saw in documentation mentioned Korean and Japanese I tried to use the analyzers. But I got an exception “analyzer not found for field” for those languages, for others languages I have no problem.
    can you point me to what is the better way to sort these languages? or if you can recommend me some plugins that can help me.

    Thanks for your help.

    Like

    1. Greg Ichneumon Brown Avatar

      Ya, I really need to write an updated version of this post. Our analyzer configuration is here: https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-analyzer-builder.php

      For each field that we want to be analyzed for different languages we create the mappings with the following function: https://github.com/Automattic/wpes-lib/blob/master/src/common/class.wpes-wp-mappings.php#L282

      That gives us fields such as content.default, content.en, content.ja, etc

      We are running this code in production on ES 2.3 and 2.4.

      Hope that helps.

      Like

      1. Lourdes Avatar
        Lourdes

        Thanks Greg, and first sorry but I’m still confused and trying to get this working..
        I see on those links what you mention, and I see for japanese (ja) some spsecific customization kuromoji or cjk and others..
        and I don´t understand how to use it.
        doing this in my mapping, and the following search http://localhost:9200/my_index/_search?pretty=true -d
        works for the other languages.. except for japanese and korean.. [analyzer not found error]
        example:
        in my mapping:
        “type_jp”: {
        “type”: “string”,
        “analyzer”: “japanese”,
        “fields”: {
        “raw”: {
        “type”: “string”,
        “analyzer”: “case_insensitive_sort”
        }
        }
        }
        my search for language:
        {
        “from”: 0,
        “size”: 50,
        “query”: {
        “match_all”: {

        }
        },
        “sort”: [{
        “source.sender.type_jp.raw”: {
        “order”: “asc”,
        “nested_path”: “source.sender”
        }
        }]
        }
        do I need to add kuromoji plugin for japanese and another for korean?
        sorry again for ask.. I’ve been reading a lot but I can´t find how to manage these languages.

        Liked by 1 person

      1. Lourdes Avatar
        Lourdes

        Yes, I could install and use that plugin. thanks!

        Like

Leave a comment