1
votes

I'm an elasticsearch newbie.

Lets say we have a class like this:

public class A
{
    public string name;
}

And we have 2 documents which have names like "Ayşe" and "Ayse".

Now, I want to be able to store names with their accents but when I search want to be able take results of accent insensitive query as accent sensitive results.

For ex: When I search for "Ayse" or "Ayşe", it should return both "Ayşe" and "Ayse" as they stored (with accent).

Right now when I search for "Ayse" it only returns "Ayse" but I want to have "Ayşe" as a result too.

When I checked elasticsearch documentation, I see that folded properties is needed to be used to achive that. But I couldn't understand how to do it with Nest attributes / functions.

BTW I'm using AutoMap to create mappings right now and if it is possible I want to be able to continue to use it.

I'm searching for an answer for 2 days right now and couldn't figure it out yet.

What/where changes are required? Can you provide me code sample(s)?

Thank you.

EDIT 1:

I figured out how to use analyzers to create sub fields of a property and achive results with term based query against sub fields.

Now, I know I can do a multi field search but is there a way to include sub fields with full text search?

Thank you.

1
I think what I need is like elastic.co/guide/en/elasticsearch/guide/current/… but I don't understand how to use it with .net because I have only 1 property which is name and don't have name.folded. So I can not query name.folded by linq. And also in my real situation I have tons of properties to query, so I'm wondering if there is a way to store only original value like 'Ayşe' and query like 'Ayse' and get the 'Ayşe' as a result. There are two attribute parameters of Nest.String: Analyzer and SearchAnalyzer. I have tons of questions to ask about itzokkan
@DhruvPal Because I don't have all the possibilities recorded, the method you mentioned is not seem to be useful for my situation.zokkan

1 Answers

3
votes

You can configure an analyzer to perform analysis on the text at index time, index this into a multi_field to use at query time, as well as keep the original source to return in the result. Based on what you have in your question, it sounds like you want a custom analyzer that uses the asciifolding token filter to convert to ASCII characters at index and search time.

Given the following document

public class Document
{
    public int Id { get; set;}
    public string Name { get; set; }
}

Setting up a custom analyzer can be done when an index is created; we can also specify the mapping at the same time

client.CreateIndex(documentsIndex, ci => ci
    .Settings(s => s
        .NumberOfShards(1)
        .NumberOfReplicas(0)
        .Analysis(analysis => analysis
            .TokenFilters(tokenfilters => tokenfilters
                .AsciiFolding("folding-preserve", ft => ft
                    .PreserveOriginal()
                )
            )
            .Analyzers(analyzers => analyzers
                .Custom("folding-analyzer", c => c
                    .Tokenizer("standard")
                    .Filters("standard", "folding-preserve")
                )
            )
        )
    )
    .Mappings(m => m
        .Map<Document>(mm => mm
            .AutoMap()
            .Properties(p => p
                .String(s => s
                    .Name(n => n.Name)
                    .Fields(f => f
                        .String(ss => ss
                            .Name("folding")
                            .Analyzer("folding-analyzer")
                        )
                    )
                    .NotAnalyzed()
                )
            )
        )
    )
);

Here I've created an index with one shard and no replicas (you may want to change this for your environment), and have created a custom analyzer, folding-analyzer that uses the standard tokenizer in conjunction with the standard token filter and a folding-preserve token filter that perform ascii folding, storing the original tokens in addition to the folded tokens (more on why this may be useful in a minute).

I've also mapped the Document type, mapping the Name property as a multi_field, with default field not_analyzed (useful for aggregations) and a .folding sub-field that will be analyzed with the folding-analyzer. The original source document will also be stored by Elasticsearch by default.

Now let's index some documents

client.Index<Document>(new Document { Id = 1, Name = "Ayse" });
client.Index<Document>(new Document { Id = 2, Name = "Ayşe" });

// refresh the index after indexing to ensure the documents just indexed are
// available to be searched
client.Refresh(documentsIndex);

Finally, searching for Ayşe

var response = client.Search<Document>(s => s
    .Query(q => q
        .QueryString(qs => qs
            .Fields(f => f
                .Field(c => c.Name.Suffix("folding"))
            )
            .Query("Ayşe")
        )
    )
);

yields

{
  "took" : 2,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 2,
    "max_score" : 1.163388,
    "hits" : [ {
      "_index" : "documents",
      "_type" : "document",
      "_id" : "2",
      "_score" : 1.163388,
      "_source" : {
        "id" : 2,
        "name" : "Ayşe"
      }
    }, {
      "_index" : "documents",
      "_type" : "document",
      "_id" : "1",
      "_score" : 0.3038296,
      "_source" : {
        "id" : 1,
        "name" : "Ayse"
      }
    } ]
  }
}

Two things to highlight here:

Firstly, the _source contains the original text that was sent to Elasticsearch so by using response.Documents, you will get the original names, for example

string.Join(",", response.Documents.Select(d => d.Name));

would give you "Ayşe,Ayse"

Secondly, remember that we preserved the original tokens in the asciifolding token filter? Doing so means that we can perform queries that undergo analysis to match accent insensitively but also take into account accent sensitivity when it comes to scoring; in the example above, the score for Ayşe matching Ayşe is higher than for Ayse matching Ayşe because the tokens Ayşe and Ayse are indexed for the former whilst only Ayse is indexed for the latter. When a query that undergoes analysis is performed against the Name property, the query is analyzed with the folding-analyzer and a search for matches is performed

Index time
----------

document 1 name: Ayse --analysis--> Ayse

document 2 name: Ayşe --analysis--> Ayşe, Ayse  


Query time
-----------

query_string query input: Ayşe --analysis--> Ayşe, Ayse

search for documents with tokens for name field matching Ayşe or Ayse