3
votes

I'm use elastic search for about one month and i've found one thing one query fuzzie that i can't understand.

The scenario is i've a set of users on a type and index almost 10.000 items, and i want to search for username, and return all the items that match with search string in a fuzzy mode, for example my user is "masterviana" if i search by only with text "mastervi" i expect to see the masterviana at the top of results using a fuzzy query right?

"fuzzy" : {
    "public_name" : {
        "value" :         "mastervi",
        "boost" :         1.0,
        "fuzziness" :     2,
        "prefix_length" : 0,
        "max_expansions": 100
    }
}

However i'm not seeing my username (masterviana) at the first page and also i see usernames that are "less similar" like my query string, i'll show the only the first 5 hits for not extended to much the post

 {
            "_index": "username",
            "_type": "username",
            "_id": "2061|FZ4y1t042482S3EqobiVllmv00",
            "_score": 9.198499,
            "_source": {
                "public_name": "masterv",
                "bbid": "FZ4y1t042482S3EqobiVllmv00",
                "hash": 2061,
                "avata": "http://goo.gl/4CRt3v"
            }
        },
        {
            "_index": "username",
            "_type": "username",
            "_id": "2048|r0I5XZ31076phruMS1gu9Hjv00",
            "_score": 5.9688096,
            "_source": {
                "public_name": "project--master",
                "bbid": "r0I5XZ31076phruMS1gu9Hjv00",
                "hash": 2048,
                "avata": "http://goo.gl/4CRt3vr"
            }
        },
        {
            "_index": "username",
            "_type": "username",
            "_id": "1980|W5Wal166832UV5oCqUH9Vjcv00",
            "_score": 5.7984095,
            "_source": {
                "public_name": "masterjv",
                "bbid": "W5Wal166832UV5oCqUH9Vjcv00",
                "hash": 1980,
                "avata": "http://goo.gl/4CRt3v"
            }
        },
        {
            "_index": "username",
            "_type": "username",
            "_id": "2108|Kufhm899338GPWHsuoei1HOv00",
            "_score": 5.7984095,
            "_source": {
                "public_name": "master25",
                "bbid": "Kufhm899338GPWHsuoei1HOv00",
                "hash": 2108,
                "avata": "http://goo.gl/4CRt3v"
            }
        },
        {
            "_index": "username",
            "_type": "username",
            "_id": "1952|AtPw2a97575sC5JT406msOXv00",
            "_score": 5.7984095,
            "_source": {
                "public_name": "masterpiz",
                "bbid": "AtPw2a97575sC5JT406msOXv00",
                "hash": 1952,
                "avata": "http://goo.gl/4CRt3v"
            }
        }, 

AS you can see i'm getting at top 1. masterv 2. project-master i think my query "mastervi" is more close to "masterviana" that for example "masterv" or "project-master"

One more thing if i search with exactly the same text "masterviana" i'm getting only this item

1

1 Answers

1
votes

The ranking is a blend of edit distance and (often unhelpfully) how rare a term is. I'm not sure which of these is to blame in this case but the term scarcity ranking is a long-standing Lucene issue. There is a work-around in elasticsearch with FuzzyLikeThisQuery but that might not be around for much longer so this has accelerated the need to fix Lucene (see here for background https://github.com/elastic/elasticsearch/pull/10391 )