1
votes

I'm preparing an in-site search engine with elasticsearch and I'm new to elasticsearch. Sites which will use this engine are Turkish / English.

In Turkey, we have Turkish letters like 'ğ', 'ü', 'ş', 'ı', 'ö', 'ç'. But when we search generally we use the letters 'g', 'u', 's', 'i', 'o', 'c'. This is not a rule but we generally do it, think like a habit, something we used to.

Now, I have a document type called "product" and this type has several string properties and some are nested. For example:

public class Product {
    public string ProductName { get; set; }
    public Category Category { get; set; }
    //...
}
public class Category {
    public string CategoryName { get; set; }
    //...
}

My goal is this:

  • ProductName or Category.CategoryName may contain Turkish letters ("Eşarp") or some may be mistyped and written with English letters ("Esarp")
  • Querystring may contain Turkish letters ("eşarp") or not ("esarp")
  • Querystring may have multiple words
  • Every indexed string field should be searched against querystring (full-text search)

Now, what I did:

  • While creating index, I also configure mappings and used a custom analyzer called "sanalyze" which uses "lowercase" and "asciifolding" filters and standard tokenizer instead of standard analyzer.
  • Used that custom analyzer for string fields mappings.

Example code for mapping:

// some more mappings which uses the same mapping for all string fields.
.Map<Yaziylabir.Extensions.TagManagement.Models.TagModel>(m => m.AutoMap().Properties(p => p
    .String(s => s
        .Name(n => n.Tag).Analyzer("sanalyze")))))
.Settings(s => s
    .Analysis(ans => ans
        .Analyzers(anl => anl
            .Custom("sanalyze", c => c
                .Tokenizer("standard")
                .Filters("lowercase", "asciifolding")))));
  • I deleted, recreated and indexed my index
  • Now I'm trying to search in that index.

I tried with two different query to search against stored documents:

q &= Query<ProductModel>.QueryString(t => t.Query(Keyword).Analyzer("sanalyze"));

q &= Query<ProductModel>.QueryString(t => t.Query(Keyword));

The second doesn't use Analyzer method because in elasticsearch documentation, it says that elasticsearch will use the analyzer used on a field. So I think there is no need to define it again while searching.

What I got as result:

  • First query (with Analyzer("sanalyze")): When I search "eşarp" or "esarp", No results. When I search "bordo", I got results.
  • Second query (without analyzer("sanalyze")): When I search "eşarp", I got results. When I search "esarp", No results. When I search "bordo", I got results.

BTW:

  • Documents contain "Eşarp" as ProductName value and when I checked elasticsearch created "esarp" field term.

  • Documents contain "Bordo" as value and "bordo" as field term.

I couldn't achive what I want. What do I do wrong? - Should I use another filter instead of asciifolding? - Should I use preserveOriginal with asciifolding? I don't want to use that option to not to screw scores. - Something different to do?

Can you please help me?

If you think it is not clear what I'm asking, please tell me, I will try to make it clearer.

Thank you.

1
@RussCam, this is my new question :-) If you can help me, I'd be most grateful.zokkan
it looks like you have an encoding issue. Ascii encoding removes not printable characters. So you only get characters 0-127 and not 128 - 255 which is where the non standard Arabic character are located. I'm not sure if you text also may contain unicode characters. I have seen same issue using ToString() method which also uses Ascii encoding.jdweng
@jdweng but there is a thing which confuses me. When a property has value like "Eşarp", I check and verify that "esarp" is created as a term/token. So in my logic, the sanalyze alanyzer works good to index. While searching I guess I need to use something (I don't know what) to do the same to querystring as what sanalyze analyzer does to string fields while indexing and then search that analyzed querystring in indexed terms. Am I wrong? Like if "eşarp" is indexed and saved "esarp" as a term then if I use "eşarp" as querystring, if it could be searched like "esarp" there should be no problem?zokkan
Any string search method should work. The filtering is causing issues. I don't know Turkish, but have seen many strange issues with different languages. Some languages have more than one UpperCase/Lower case for characters. asciifolding will ignore some characters so it will work in some instances and not in others. You may need to create your own Upper/Lower Case method to resolve issue.I wouldn't use the asciiFolding.jdweng
@zokkan- this may help you - stackoverflow.com/a/37525868/1831Russ Cam

1 Answers

1
votes

Using the default settings for query_string means you are searching in the _all field. The _all field has its own analyzer - the standard one.

You need to specify on which field you want query_string to act on:

  "query": {
    "query_string": {
      "query": "your_field_name:esarp"
    }
  }

or

  "query": {
    "query_string": {
      "query": "esarp",
      "default_field": "your_field_name"
    }
  }