You can configure an analyzer to perform analysis on the text at index time, index this into a multi_field to use at query time, as well as keep the original source to return in the result. Based on what you have in your question, it sounds like you want a custom analyzer that uses the asciifolding
token filter to convert to ASCII characters at index and search time.
Given the following document
public class Document
{
public int Id { get; set;}
public string Name { get; set; }
}
Setting up a custom analyzer can be done when an index is created; we can also specify the mapping at the same time
client.CreateIndex(documentsIndex, ci => ci
.Settings(s => s
.NumberOfShards(1)
.NumberOfReplicas(0)
.Analysis(analysis => analysis
.TokenFilters(tokenfilters => tokenfilters
.AsciiFolding("folding-preserve", ft => ft
.PreserveOriginal()
)
)
.Analyzers(analyzers => analyzers
.Custom("folding-analyzer", c => c
.Tokenizer("standard")
.Filters("standard", "folding-preserve")
)
)
)
)
.Mappings(m => m
.Map<Document>(mm => mm
.AutoMap()
.Properties(p => p
.String(s => s
.Name(n => n.Name)
.Fields(f => f
.String(ss => ss
.Name("folding")
.Analyzer("folding-analyzer")
)
)
.NotAnalyzed()
)
)
)
)
);
Here I've created an index with one shard and no replicas (you may want to change this for your environment), and have created a custom analyzer, folding-analyzer
that uses the standard tokenizer in conjunction with the standard
token filter and a folding-preserve
token filter that perform ascii folding, storing the original tokens in addition to the folded tokens (more on why this may be useful in a minute).
I've also mapped the Document
type, mapping the Name
property as a multi_field
, with default field not_analyzed
(useful for aggregations) and a .folding
sub-field that will be analyzed with the folding-analyzer
. The original source document will also be stored by Elasticsearch by default.
Now let's index some documents
client.Index<Document>(new Document { Id = 1, Name = "Ayse" });
client.Index<Document>(new Document { Id = 2, Name = "Ayşe" });
// refresh the index after indexing to ensure the documents just indexed are
// available to be searched
client.Refresh(documentsIndex);
Finally, searching for Ayşe
var response = client.Search<Document>(s => s
.Query(q => q
.QueryString(qs => qs
.Fields(f => f
.Field(c => c.Name.Suffix("folding"))
)
.Query("Ayşe")
)
)
);
yields
{
"took" : 2,
"timed_out" : false,
"_shards" : {
"total" : 1,
"successful" : 1,
"failed" : 0
},
"hits" : {
"total" : 2,
"max_score" : 1.163388,
"hits" : [ {
"_index" : "documents",
"_type" : "document",
"_id" : "2",
"_score" : 1.163388,
"_source" : {
"id" : 2,
"name" : "Ayşe"
}
}, {
"_index" : "documents",
"_type" : "document",
"_id" : "1",
"_score" : 0.3038296,
"_source" : {
"id" : 1,
"name" : "Ayse"
}
} ]
}
}
Two things to highlight here:
Firstly, the _source
contains the original text that was sent to Elasticsearch so by using response.Documents
, you will get the original names, for example
string.Join(",", response.Documents.Select(d => d.Name));
would give you "Ayşe,Ayse"
Secondly, remember that we preserved the original tokens in the asciifolding token filter? Doing so means that we can perform queries that undergo analysis to match accent insensitively but also take into account accent sensitivity when it comes to scoring; in the example above, the score for Ayşe matching Ayşe is higher than for Ayse matching Ayşe because the tokens Ayşe and Ayse are indexed for the former whilst only Ayse is indexed for the latter. When a query that undergoes analysis is performed against the Name
property, the query is analyzed with the folding-analyzer
and a search for matches is performed
Index time
----------
document 1 name: Ayse --analysis--> Ayse
document 2 name: Ayşe --analysis--> Ayşe, Ayse
Query time
-----------
query_string query input: Ayşe --analysis--> Ayşe, Ayse
search for documents with tokens for name field matching Ayşe or Ayse