Another solution to this is to use the
nGram
token filter, which would allow you to have a more "fuzzy" match.
Using your example for "microsoft" and "micro soft", here is an example of how
an ngram token filter would break down the tokens:
POST /test
{
"settings": {
"analysis": {
"filter": {
"my_ngrams": {
"type": "ngram",
"min_gram": "3",
"max_gram": "5"
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "standard",
"filter": ["my_ngrams"]
}
}
}
},
"mappings": {
"doc": {
"properties": {
"body": {
"type": "string",
"analyzer": "my_analyzer"
}
}
}
}
}
And analyzing the two things:
curl '0:9200/test/_analyze?field=body&pretty' -d'microsoft'
{
"tokens" : [ {
"token" : "mic"
}, {
"token" : "micr"
}, {
"token" : "micro"
}, {
"token" : "icr"
}, {
"token" : "icro"
}, {
"token" : "icros"
}, {
"token" : "cro"
}, {
"token" : "cros"
}, {
"token" : "croso"
}, {
"token" : "ros"
}, {
"token" : "roso"
}, {
"token" : "rosof"
}, {
"token" : "oso"
}, {
"token" : "osof"
}, {
"token" : "osoft"
}, {
"token" : "sof"
}, {
"token" : "soft"
}, {
"token" : "oft"
} ]
}
curl '0:9200/test/_analyze?field=body&pretty' -d'micro soft'
{
"tokens" : [ {
"token" : "mic"
}, {
"token" : "micr"
}, {
"token" : "micro"
}, {
"token" : "icr"
}, {
"token" : "icro"
}, {
"token" : "cro"
}, {
"token" : "sof"
}, {
"token" : "soft"
}, {
"token" : "oft"
} ]
}
(I cut out some of the output, full output here:
https://gist.github.com/dakrone/10abb4a0cfe8ce8636ad)
As you can see, since the ngram terms for "microsoft" and "micro soft" overlap,
you will be able to find matches for searches like this.