In SQL, I can search email addresses pretty well with SQL LIKE.
With an email "[email protected]", searching "stack", "@domain.com", "domain.com", or "domain" would get me back the desired email address.
How can I get the same result with ElasticSearch?
I played with nGram, edgeNGram, uax_url_email, etc and the search results have been pretty bad. Please correct me if I'm wrong, it sounds like I have to do the following:
- for index_analyzer
- use "keyword", "whitespace", or "uax_url_email" tokenizer so the email don't get tokenized
- but wildcard queries don't seem to work (with tire at least)
- use "nGram" or "edgeNGram" for filter
- I always get way too many unwanted results like getting "[email protected]" when searching "first-second".
- use "keyword", "whitespace", or "uax_url_email" tokenizer so the email don't get tokenized
- for search_analyzer
- don't do nGram
One experiment code
tire.settings :number_of_shards => 1,
:number_of_replicas => 1,
:analysis => {
:filter => {
:db_ngram => {
"type" => "nGram",
"max_gram" => 255,
"min_gram" => 3 }
},
:analyzer => {
:string_analyzer => {
"tokenizer" => "standard",
"filter" => ["standard", "lowercase", "asciifolding", "db_ngram"],
"type" => "custom" },
:index_name_analyzer => {
"tokenizer" => "standard",
"filter" => ["standard", "lowercase", "asciifolding"],
"type" => "custom" },
:search_name_analyzer => {
"tokenizer" => "whitespace",
"filter" => ["lowercase", "db_ngram"],
"type" => "custom" },
:index_email_analyzer => {
"tokenizer" => "whitespace",
"filter" => ["lowercase"],
"type" => "custom" }
}
} do
mapping do
indexes :id, :index => :not_analyzed
indexes :name, :index_analyzer => 'index_name_analyzer', :search_analyzer => 'search_name_analyzer'
indexes :email, :index_analyzer => 'index_email_analyzer', :search_analyzer => 'search_email_analyzer'
end
end
Specific cases that don't work well:
- emails with hyphen (eg. [email protected])
- query string '@' at the beginning or end
- exact matches
- searching with wildcard like '@' gets very unexpected results.
Suppose I have, "[email protected]", "[email protected]", and "[email protected], searching "aaa" gives me "[email protected]" "[email protected]. Searching "aaa*" give me everything, but "aaa-*" gives me nothing. So, how should I do exact match wildcard queries? For these type of queries, I get pretty much the same results for different tokenizer/analyzer.
I do these after each mapping change: Model.tire.index.delete Model.tire.create_elasticsearch_index Model.tire.index.import Model.all
References: