4
votes

Using elasticsearch-dsl, I'm trying to search for the closest match to a company name, but to exclude exact matches.

For example, I want to search for names similar to 'Greater London Authority (GLA)', but I want all exact matches to be either filtered out or given a significant downgrading in the score.

To clarify, in my index I know that the string 'Greater London Authority' exists, and would like this to be returned as being a better result than the original string (also in the index)

Currently I have:

mn =  Q({
    "bool": {
      "must_not": [
        {
          "match": {
            "buyer": entity_name
          }
        }
      ]
    }
  }
)

s = Search(using=es, index="ccs_notices9") \
          .query("match", buyer=entity_name)\
          .query(mn)
         
results = s.execute(s)
results.to_dict()

But I get no results, which makes sense as I'm basically reversing the two queries. I've tried to use "term" in place of "match" in the mn query, but this isn't allowed. I've also tried a more simpler :

s = Search(using=es, index="ccs_notices9") \
          .query("match", buyer=entity_name)\
          .exclude("term", buyer=entity_name)

Which does give me results, but the string above is still included.

1

1 Answers

2
votes

You would have to make use of two different fields in order to achieve what you are looking for. In short make use of multi-fields for buyer as I've done in the below use case.

Mapping:

PUT my_exact_match_exclude
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase"]
        }
      }
    }
  }, 
  "mappings": {
    "properties": {
      "buyer": {
        "type": "text",
        "fields": {
          "keyword": {                         <---- Note this
            "type": "keyword", 
            "normalizer": "my_normalizer"      <---- Note this. To take care of case sensitivity    
          }
        }
      }
    }
  }
}

Note the mapping for city has sibling field with keyword datatype using multi-fields.

Also read about normalizer and why I've applied it on keyword is just to make sure case insensitivity is taken into consideration while doing exact match.

Sample Docs:

POST my_exact_match_exclude/_doc/1
{
  "buyer": "Greater London Authority (GLA)"
}

POST my_exact_match_exclude/_doc/2
{
  "buyer": "Greater London Authority"
}

POST my_exact_match_exclude/_doc/3
{
  "buyer": "Greater London"
}

POST my_exact_match_exclude/_doc/4
{
  "buyer": "London Authority"
}

POST my_exact_match_exclude/_doc/5
{
  "buyer": "greater london authority (GLA)"
}

Note that the first and the last documents are exact similar if you take into consideration case-insensitivity.

Sample Query:

POST my_exact_match_exclude/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "buyer": "Greater London Authority (GLA)"
          }
        }
      ],
      "must_not": [
        {
          "term": {
            "buyer.keyword": "Greater London Authority (GLA)".         
          }
        }
      ]
    }
  }
}

Note that I'm applying must_not on buyer.keyword field so as to avoid all the terms with the exact matches.

Sample Response:

{
  "took" : 0,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 3,
      "relation" : "eq"
    },
    "max_score" : 0.66237557,
    "hits" : [
      {
        "_index" : "my_exact_match_exclude",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.66237557,
        "_source" : {
          "buyer" : "Greater London Authority"
        }
      },
      {
        "_index" : "my_exact_match_exclude",
        "_type" : "_doc",
        "_id" : "3",
        "_score" : 0.4338556,
        "_source" : {
          "buyer" : "Greater London"
        }
      },
      {
        "_index" : "my_exact_match_exclude",
        "_type" : "_doc",
        "_id" : "4",
        "_score" : 0.4338556,
        "_source" : {
          "buyer" : "London Authority"
        }
      }
    ]
  }
}

As expected the documents 1 and 5 do not return as they are the exact matches.

You can go ahead and make use of the above query in similar way on your code.

Hope this helps!