0
votes

I like the results I am getting from Elasticsearch using Edge-NGrams to index data and a different analyzer for searching. I would, however, prefer that shorter terms that match get ranked higher than longer terms.

For example, take the terms ABC100 and ABC100xxx. If I perform a query using the term ABC, I get back both of these documents as hits with the same score. What I would like is for ABC100 to be scored higher than ABC100xxx because ABC closer matches ABC100 according to something like the Levenshtein distance algorithm.


Setting up the index:

PUT stackoverflow
{
    "settings": {
        "index": {
            "number_of_replicas": 0,
            "number_of_shards": 1
        },
        "analysis": {  
            "filter": {
                "edge_ngram": {
                    "type": "edgeNGram",
                    "min_gram": "1",
                    "max_gram": "20"
                }
            },

            "analyzer": {
              "my_analyzer": {
                "type": "custom",
                "tokenizer": "whitespace",
                "filter": [
                  "edge_ngram"
                ]
              }
            }
        }
    },

    "mappings": {
        "doc": {
            "properties": {
                "product": {
                  "type": "text",
                  "analyzer": "my_analyzer",
                  "search_analyzer": "whitespace"
                }
            }
        }
    }
}

Inserting documents:

PUT stackoverflow/doc/1
{
    "product": "ABC100"
}

PUT stackoverflow/doc/2
{
    "product": "ABC100xxx"
}

Search query:

GET stackoverflow/_search?pretty
{
  "query": {
    "match": {
      "product": "ABC"
    }
  }
}

Results:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.28247002,
    "hits": [
      {
        "_index": "stackoverflow",
        "_type": "doc",
        "_id": "2",
        "_score": 0.28247002,
        "_source": {
          "product": "ABC100xxx"
        }
      },
      {
        "_index": "stackoverflow",
        "_type": "doc",
        "_id": "1",
        "_score": 0.28247002,
        "_source": {
          "product": "ABC100"
        }
      }
    ]
  }
}

Does anyone know how I may have a shorter term such as ABC100 ranked higher than ABC100xxx?

1

1 Answers

0
votes

After finding plenty of less than optimal solutions regarding storing field length as a field or using a script query, I found the root of my problem. It was simply because I was using the edge_ngrams token filter instead of the the edge_ngrams tokenizer.