0
votes

I'm new to hibernate lucene search. From few days on wards, I am working on search keyword with special characters. I am using MultiFieldQueryParser for exact phrase matching as well as Boolean search. But in this process I am unable to get the results with search keywords like "Having 1+ years of experience" and if I am not putting any quotes around the search keyword then I am getting results. So what I observed in the execution of lucene query is, it is escaping the special symbols(+). I am using StandardAnalyzer.class. I think, If I am using WhiteSpaceAnalyzer it will not escape the special characters but it may effect the Boolean searching like +java +php(i.e java and php) because it may treat as normal text. so please assist some suggestions.

The following is my snippet:

Session session = getSession();
        FullTextSession fullTextSession = Search.getFullTextSession(session);

        MultiFieldQueryParser parser = new MultiFieldQueryParser(new String[] { "student.skills.skill",
                "studentProfileSummary.profileTitle", "studentProfileSummary.currentDesignation" },
                new StandardAnalyzer());
        parser.setDefaultOperator(Operator.OR);
        org.apache.lucene.search.Query luceneQuery = null;
        QueryBuilder qb = fullTextSession.getSearchFactory().buildQueryBuilder().forEntity(Student.class).get();
        BooleanQuery boolQuery = new BooleanQuery();
        if (StringUtils.isEmpty(zipcode) != true && StringUtils.isBlank(zipcode) != true) {
            boolQuery.add(
                    qb.keyword().onField("personal.locations.postalCode").matching(zipcode).createQuery(),
                    BooleanClause.Occur.MUST);
        }
        if (StringUtils.isEmpty(query) != true && StringUtils.isBlank(query) != true) {
            try {
                luceneQuery = parser.parse(query.toUpperCase());
            } catch (ParseException e) {
                luceneQuery = parser.parse(parser.escape(query.toUpperCase()));
            }
            boolQuery.add(luceneQuery, BooleanClause.Occur.MUST);
        }
        boolQuery.add(qb.keyword().onField("vStatus").matching(1).createQuery(), BooleanClause.Occur.MUST);
        boolQuery.add(qb.keyword().onField("status").matching(1).createQuery(), BooleanClause.Occur.MUST);
        boolQuery.add(qb.range().onField("studentProfileSummary.profilePercentage").from(80).to(100).createQuery(),
                BooleanClause.Occur.MUST);
        FullTextQuery createFullTextQuery = fullTextSession.createFullTextQuery(boolQuery, Student.class);
        createFullTextQuery.setProjection("id", "studentProfileSummary.profileTitle", "firstName","lastName");

        if (isEmptyFilter == false) {
            createFullTextQuery.setFirstResult((int) pageNumber);
            createFullTextQuery.setMaxResults((int) end);
        }
        return createFullTextQuery.list();
1

1 Answers

1
votes

The key to control such effects is indeed in the Analyzers you choose to use. As you noticed the standard Analyzer is going to remove/ignore some symbols as they are commonly not used.

Since the standard analyzer is good with most english natural language but you want to treat also special symbols, the typical solution is to index text into multiple fields, and you assign a different Analyzer to each field. You can then generate the queries targeting both fields, and combine the scores it obtains from both fields. You can even customize the weight that each field shoudl have and experiment with different Similarity implementations to obtain various effects.

But un your specific example of "1+ years" you might want to consider what you expect it to find. Should it match a string "6 years"? Then you probably want to implement a custom analyzer which specifically looks for such patterns and generates multiple matching tokens like a sequence {"1 year", "2 years", "3 years", ...}. That's going to be effective but only match that specific sequence of terms, so maybe you want to look for more advanced extensions from the Lucene community, as you can plug many more extensions in it.