0
votes

We are using hibernate search orm 5.9.2 and would like to achieve the exact search results like:

If user starts with

John -> all data with John should display
John Murphy -> all data with John murphy should display
John murphy Columbia -> Only data with John murphy Columbia should display 
John murphy Columbia SC -> Only data with John murphy Columbia should display  
John murphy Columbia SC 29201 -> Only data with John murphy Columbia SC 29201

29201 -> Only data with 29201 as zipcode should be displayed.
and so on...

Basically we are trying to achieve search on exact records from multiple fields on index.

We have entity containing this data in fields like Name, Address1, address2, city, zipcode, state.

We have tried bool()(with should/must) queries, but as we are not sure what data will user enter first, it could be zipcode, state, city any where in the text search.

Please share your knowledge/logic with regards to analyzers/strategy which we can use to accomplish this with hibernate search/lucene.

Below is the index structure:

> {
>         "_index" : "client_master_index_0300",
>         "_type" : "com.csc.pt.svc.data.to.Basclt0300TO",
>         "_id" : "518,1",
>         "_score" : 4.0615783,
>         "_source" : {
>           "id" : "518,1",
>           "cltseqnum" : 518,
>           "addrseqnum" : "1",
>           "addrln1" : "Dba",
>           "addrln2" : "Betsy Evans",
>           "city" : "SDA",
>           "state" : "SC",
>           "zipcode" : "89756-4531",
>           "country" : "USA",
>           "basclt0100to" : {
>             "cltseqnum" : 518,
>             "clientname" : "Betsy Evans",
>             "longname" : "Betsy Evans",
>             "id" : "518"
>           },
>           "basclt0900to" : {
>             "cltseqnum" : 518,
>             "id" : "518"
>           }
>         }
>       }

Below is the input

Akash Agrawal 29021

the response contains all records matching akash, agrwal, 29,2, 1, 01 etc...

What we are trying to achieve is the exact search result, with respect to above search input the results should only contain data with Akash Agrawal 29201 and not other data.

We are basically searching on basclt0100to.longname, addrln1, addrln2, city, state, zipcode, country.

The index definition is below

> {
>   "client_master_index_0300" : {
>     "aliases" : { },
>     "mappings" : {
>       "com.csc.pt.svc.data.to.Basclt0300TO" : {
>         "dynamic" : "strict",
>         "properties" : {
>           "addrln1" : {
>             "type" : "text",
>             "store" : true
>           },
>           "addrln2" : {
>             "type" : "text",
>             "store" : true
>           },
>           "addrln3" : {
>             "type" : "text",
>             "store" : true
>           },
>           "addrseqnum" : {
>             "type" : "text",
>             "store" : true
>           },
>           "basclt0100to" : {
>             "properties" : {
>               "clientname" : {
>                 "type" : "text",
>                 "store" : true
>               },
>               "cltseqnum" : {
>                 "type" : "long",
>                 "store" : true
>               },
>               "firstname" : {
>                 "type" : "text",
>                 "store" : true
>               },
>               "id" : {
>                 "type" : "keyword",
>                 "store" : true,
>                 "norms" : true
>               },
>               "longname" : {
>                 "type" : "text",
>                 "store" : true
>               },
>               "midname" : {
>                 "type" : "text",
>                 "store" : true
>               }
>             }
>           },
>           "basclt0900to" : {
>             "properties" : {
>               "cltseqnum" : {
>                 "type" : "long",
>                 "store" : true
>               },
>               "email1" : {
>                 "type" : "text",
>                 "store" : true
>               },
>               "id" : {
>                 "type" : "keyword",
>                 "store" : true,
>                 "norms" : true
>               }
>             }
>           },
>           "city" : {
>             "type" : "text",
>             "store" : true
>           },
>           "cltseqnum" : {
>             "type" : "long",
>             "store" : true
>           },
>           "country" : {
>             "type" : "text",
>             "store" : true
>           },
>           "id" : {
>             "type" : "keyword",
>             "store" : true
>           },
>           "state" : {
>             "type" : "text",
>             "store" : true
>           },
>           "zipcode" : {
>             "type" : "text",
>             "store" : true
>           }
>         }
>       }
>     },
>     "settings" : {
>       "index" : {
>         "creation_date" : "1535607176216",
>         "number_of_shards" : "5",
>         "number_of_replicas" : "1",
>         "uuid" : "x4R71LNCTBSyO9Taf8siOw",
>         "version" : {
>           "created" : "6030299"
>         },
>         "provided_name" : "client_master_index_0300"
>       }
>     }
>   }
> }

I've till now tried using edgengraanalyzer, standard analyzer of lucene query. I've tried with Bool() query, keyword query, phrase, tried all that is available under documentation.

But I'm sure I'm missing the strategy/logic which we should use.

Below is the current query I'm using and is giving the attached snapshot results

 Query finalQuery = queryBuilder.simpleQueryString()
            .onFields("basclt0100to.longname", "addrln1", "addrln2" 
                ,"city","state","zipcode", "country")
            .withAndAsDefaultOperator()
            .matching(lowerCasedSearchTerm)
            .createQuery();

        FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(finalQuery, Basclt0300TO.class);
        fullTextQuery.setMaxResults(this.data.getPageSize()).setFirstResult(this.data.getPageSize());

        List<String> projectedFields = new ArrayList<String>();
        for (String fieldName : projections)
                projectedFields.add(fieldName);

        @SuppressWarnings("unchecked")
        List<Cltj001ElasticSearchResponseTO> results = fullTextQuery.
        setProjection(projectedFields.toArray(new String[projectedFields.size()]))
        .setResultTransformer( new BasicTransformerAdapter() {
            @Override
            public Cltj001ElasticSearchResponseTO transformTuple(Object[] tuple, String[] aliases) {
                return   new Cltj001ElasticSearchResponseTO((String) tuple[0], (long) tuple[1],
                            (String) tuple[2], (String) tuple[3], (String) tuple[4],
                            (String) tuple[5],(String) tuple[6], (String) tuple[7], (String) tuple[8]);

            }
        })
        .getResultList();
        resultsClt0300MasterIndexList = results;

searched for: akash 29201 & searched for : akash 1 main

Here you can see we have all the data containing akash , sh, 29, 292, 29201.

Expected results:

Akash Agrawal - 29201 Akash Agrawal - 1 main street, SC , 29201

Basically only exact data containing/matching the input string.

Analyzers used: Index time

    @AnalyzerDef(name = "autocompleteEdgeAnalyzer",

//Split input into tokens according to tokenizer
                tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
                         filters = {
                                   @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                                    @TokenFilterDef(factory = StopFilterFactory.class),
                                    @TokenFilterDef(
                                            factory = EdgeNGramFilterFactory.class, // Generate prefix tokens
                                            params = {
                                                    @Parameter(name = "minGramSize", value = "3"),
                                                    @Parameter(name = "maxGramSize", value = "3")
                                            }
                                    )
                            })

Query time overriding with:

    @AnalyzerDef(name = "withoutEdgeAnalyzerFactory",

// Split input into tokens according to tokenizer
                tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
                         filters = {
                                    @TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
                                    @TokenFilterDef(factory = LowerCaseFilterFactory.class),


                            }
                /*filters = {
                        // Normalize token text to lowercase, as the user is unlikely to
                        // care about casing when searching for matches
                        @TokenFilterDef(factory = PatternReplaceFilterFactory.class, params = {
                                @Parameter(name = "pattern", value = "([^a-zA-Z0-9\\.])"),
                                @Parameter(name = "replacement", value = " "),
                                @Parameter(name = "replace", value = "all") }),
                        @TokenFilterDef(factory = LowerCaseFilterFactory.class),
                        @TokenFilterDef(factory = StopFilterFactory.class) }*/)

Hope these details help.

1
@Yoann Rodière, please share ur inputs.ronak
It’s a bit hard to understand what your problem is exactly. Please at least post your entity model and your current query code, and explain what’s wrong with the current behavior, preferably with examples (one set of document, the input from the user, the actual result, the expected result). That’ll be a start.yrodiere
@Yoann Rodière , I've provided details in my question. And will take care of your suggestions about the posting of questions.ronak
I'll try again: please post your current query code, and explain what’s wrong with the current behavior. Preferably with examples (one set of document, the input from the user, the actual result, the expected result).yrodiere
@ Yoann Rodière I've update the question with query, result snapshots from my application.ronak

1 Answers

0
votes

The easiest solution, requiring to change only a little bit of code, would be to use Occur.MUST instead of Occur.SHOULD in your boolean query. Then you would only get documents that match every keyword, instead of documents matching at least one keyword as currently.

However, it's not exactly the most correct solution. Try that, then see below if you want to understand what's going on.


First, you shouldn't need to split the input string yourself; that's Lucene's (and Elasticsearch's) job, during what is called "text analysis". You really should understand text analysis before you start using Hibernate Search.

Text analysis is, in short, the process of turning a single string into "tokens" (words) that can be used in a full-text index.

Text analysis is performed in two cases (I'm simplifying, but that's more or less what happens):

  • when indexing documents, the content of each field is analyzed and Elasticsearch stored the result of the analysis (a list of tokens) in the index.
  • when querying, the query string is analyzed and the full-text engine will look for every resulting token in the index.

Text analysis consists in three steps:

  • Character filtering, which I won't describe in details because it's generally skipped, so you probably won't need it.
  • Tokenization, which splits a single string into multiple parts, called "tokens". In short it extracts words from a string.
  • Token filtering, which applies transformations to the tokens, such as turning them to lowercase, replacing diacritics with simpler equivalents ("é" => "e", "à" => "a", ...), splitting tokens furter into ngrams ("word" => ["w", "wo", "wor", "word"]), and so on.

The purpose of this text analysis is to allow for more subtle matches than just "equal strings". In particular it allows to find words in a document, to perform case-insensitive search, but also much more subtle searches such as "words starting with a given string" (using the EdgeNgramTokenFilter) or matches of seemingly unrelated words such as "wifi" and "wi fi".

How the search will behave all depends on the analyzer you applied at indexing time and at query time. Generally the same analysis is applied at indexing time and query time, but in some very specific, advanced use cases (such as when using the EdgeNGramTokenFilter) you will need to use a slightly different analyzer when querying.

As you can see, Lucene/Elasticsearch already does what you want, i.e. splitting the input string into multiple "words". Also, if you use the right analyzer, you won't need to lowercase the input string yourself, as a Token Filter will take care of that.

That's all for the basics, which you really need to understand before you start using Hibernate Search.

Now for the specifics: the problem is, when you just use the .keyword() query, the string will indeed be split into multiple words, but Hibernate Search will search for documents that match any of those words. Which is not what you want: you want to search for documents that match all of those words.

In order to do that, I would suggest you to use the "Simple Query String" query. You will create it more or less like a keyword() query, but it has nice additional features that make it more suited for a web interface such as the one you're building. In particular, it allows you to require all the "words" in the query to match by setting the default operator to "and".

For example:

Query finalQuery = queryBuilder.simpleQueryString()
         .onFields("basclt0100to.longname", "addrln1", "addrln2" 
             ,"city","state","zipcode", "country")
         .withAndAsDefaultOperator()
         .matching(searchTerms)
         .createQuery();

FullTextQuery fullTextQuery = fullTextSession.createFullTextQuery(finalQuery, Basclt0300TO.class);