2
votes

In SQL, I can search email addresses pretty well with SQL LIKE.

With an email "[email protected]", searching "stack", "@domain.com", "domain.com", or "domain" would get me back the desired email address.

How can I get the same result with ElasticSearch?

I played with nGram, edgeNGram, uax_url_email, etc and the search results have been pretty bad. Please correct me if I'm wrong, it sounds like I have to do the following:

  1. for index_analyzer
    • use "keyword", "whitespace", or "uax_url_email" tokenizer so the email don't get tokenized
      • but wildcard queries don't seem to work (with tire at least)
    • use "nGram" or "edgeNGram" for filter
      • I always get way too many unwanted results like getting "[email protected]" when searching "first-second".
  2. for search_analyzer
    • don't do nGram

One experiment code

tire.settings :number_of_shards => 1,
            :number_of_replicas => 1,
            :analysis => {
                :filter => {
                    :db_ngram  => {
                        "type"     => "nGram",
                        "max_gram" => 255,
                        "min_gram" => 3 }
                },
                :analyzer => {
                    :string_analyzer => {
                        "tokenizer"    => "standard",
                        "filter"       => ["standard", "lowercase", "asciifolding", "db_ngram"],
                        "type"         => "custom" },
                    :index_name_analyzer => {
                        "tokenizer"    => "standard",
                        "filter"       => ["standard", "lowercase", "asciifolding"],
                        "type"         => "custom" },
                    :search_name_analyzer => {
                        "tokenizer"    => "whitespace",
                        "filter"       => ["lowercase", "db_ngram"],
                        "type"         => "custom" },
                    :index_email_analyzer => {
                        "tokenizer"    => "whitespace",
                        "filter"       => ["lowercase"],
                        "type"         => "custom" }
                }
            } do
    mapping do
      indexes :id,           :index    => :not_analyzed
      indexes :name,         :index_analyzer => 'index_name_analyzer', :search_analyzer => 'search_name_analyzer'
      indexes :email,        :index_analyzer => 'index_email_analyzer', :search_analyzer => 'search_email_analyzer'
    end
end

Specific cases that don't work well:

  • emails with hyphen (eg. [email protected])
  • query string '@' at the beginning or end
  • exact matches
  • searching with wildcard like '@' gets very unexpected results.

Suppose I have, "[email protected]", "[email protected]", and "[email protected], searching "aaa" gives me "[email protected]" "[email protected]. Searching "aaa*" give me everything, but "aaa-*" gives me nothing. So, how should I do exact match wildcard queries? For these type of queries, I get pretty much the same results for different tokenizer/analyzer.

I do these after each mapping change: Model.tire.index.delete Model.tire.create_elasticsearch_index Model.tire.index.import Model.all

References:

1

1 Answers

0
votes

Considering what you are trying to accomplish, KeywordAnalyzer might be a reasonable choice of analyzer, though I don't see anything that would cause problems with a WhitespaceAnalyzer.

I suspect you are running into problems with the query parsing and analysis, although you haven't really described how you are querying. Simplest case would be to simply use term or prefix queries.

It does seem a bit like StandardAnalyzer would serve your purpose here, mostly (differentiating between "aaa_0" and "aaa-0" would be a problem), as long as it is applied consistently, and your query is correct.