0
votes

first of all, sorry for my bad English!

i am new to Lucene Library(Since last Wednesday) and im trying to understand how to get best relevance level of matching documents based on the terms found.

i use Lucene 4.10.0 (no Solr)

I'm able to index/search english/arabic text as well as supporting hit highlighting for these texts.

now i have a Problem with the relevance of search results.

if i search for "Mohammad Omar" in three docs:

doc1.add(new TextField("contents", "xyz abc, 123 Mohammad Abu Omar 123", Field.Store.YES));
indexWriter.addDocument(config.build(taxoWriter, doc1));

doc2 = new Document();
doc2.add(new TextField("contents", "xyz abc, 123 Omar bin Mohammad 123", Field.Store.YES));
indexWriter.addDocument(config.build(taxoWriter, doc2));

doc3 = new Document();
doc3.add(new TextField("contents", "xyz abc, 123 Abu Mohammad Omar 123", Field.Store.YES));
indexWriter.addDocument(config.build(taxoWriter, doc3));
...etc

i get same Score for these 3 docs.

it looks like Lucene ignores the Words Order and just scoring on the Matches Count.

i expect the following as best Results:

doc3 THEN doc1 THEN doc2

but i get:

doc1 THEN doc2 THEN doc3 (ALL HAVE SAME SCORE)

for searching in lowercase and in substrings i use an extended Analyzer like this:

   @Override
   protected TokenStreamComponents createComponents(String fieldName, Reader reader) {
     Tokenizer source = new WhitespaceTokenizer(reader);
     TokenStream filter = new LowerCaseFilter(source);   
     filter = new WordDelimiterFilter(filter,Integer.MAX_VALUE,null);
     return new TokenStreamComponents(source, filter);
   }

any idea how to perform it?

from here: http://lucene.apache.org/core/4_10_0/queryparser/org/apache/lucene/queryparser/classic/package-summary.html#Boosting_a_Term

i see that Boosting Query Terms AND/OR using RegEx could be an Option, but this means, i have to handle User inputs manually. isn't there an "out of box" Solution(like a function, Filter or Analyzer)?

many thanks!

1

1 Answers

0
votes

How does your "Mohammad Omar" query look like in terms of code? If you need just this exact phrase, feed this string into a PhraseQuery or if you use QueryParser, wrap this phrase into quotes to produce PhraseQuery.

If you need both this phrase as well as documents containing both terms separately in the search results, you could include "Mohammad Omar" both as a phrase (as specified above) and as separate terms, something like this: some_field:"Mohammad Omar" some_field:Mohammad some_field:Omar. You can also add boosting for the phrase element so that phrase results rank higher.