0
votes

I am trying to build an application that implements a search system over Lucene index. Right now the index is built, I can search for documents over the index and everything seems to be working fine but, when I make a search using a field that is used in many documents, the analyzer only returns some documents. I have tried to make the same search using Luke and is behaving the same way.

i.e: My index have 2 fields:

Field A: An identifier that is unique. Field B: A String.

First Example:

We have 5 documents:

Doc 1: FieldA:1; FieldB:hello world

Doc 2: FieldA:2; FieldB:hello world!

Doc 3: FieldA:3; FieldB:hello world

Doc 4: FieldA:4; FieldB:anything

Doc 5: FieldA:5; FieldB:hello world

When I make a search like "B: hello world" it should returns the documents 1, 3 and 5 but it only returns 1 and 3.

When I make a search like "A: 5" it returns the document 5 and the field B value is "hello world".

Second Example: (one token)

Doc 6: FieldA:6; FieldB:token

Doc 7: FieldA:7; FieldB:token

Doc 8: FieldA:8; FieldB:TOKEN

Doc 9: FieldA:9 FieldB:token

When I search FieldB:"token" it only returns Doc 6 and Doc 9. The only way I can find Doc 7 is searching by its FieldA.

I am using WhitespaceAnalyzer and both Fields are NOT_ANALYZED.

IndexGenerator Main

...

IndexWriter writer = new IndexWriter(directory, new WhitespaceAnalyzer(), true, IndexWriter.MaxFieldLength.UNLIMITED);;
writer.setRAMBufferSizeMB(200);

List<Work> works = getWorks(); //Retrieves the information from the DB

for (Work work: works) {

   Document luceneDocument = createLuceneDocument(work);
   writer.addDocument(luceneDocument);

}
writer.commit();

...

CreateLuceneDocument Method:

private static Document createLuceneDocument(Work work) {

 try {
   Document luceneDoc = new Document();

   ...

   Field id = new Field("ID", work.getId(),Field.Store.YES,Field.Index.NOT_ANALYZED);
   luceneDoc.add(id);

   Field name = new Field("NAME", work.getName(),Field.Store.YES,Field.Index.NOT_ANALYZED);
   luceneDoc.add(name);

   ...

   return document;

   }
   catch (LuceneException e) {
       ...
   }
}

I have noticed that the Documents that are not returned have a low score value. Assuming that is a problem when the index is created because Luke behaves the same way than the applicacion, what am I doing wrong?

Thanks in advance!

2
I don't see anything in your example which would give rise to such a problem. I also don't understand how can you have a score for a document which is not found with your query. Perhaps some more information would be useful, such as your search code, and some further information on data where this issue is actually occuring?femtoRgon
Thanks @femtoRgon! The example is the easiest way to explain what is happening. The real index has over 12 fields and is way more complex than the example. As I said in the first post, Luke doesn't show the documents even though the fields fulfill the search request. So, the problem should be at the index generation process. I am going to add more information about the index generation.user2993510
Where do you see "that the Documents that are not returned have a low score value"?groverboy
I have 2 fields with the same values, the first field not analyzed in which I have the problem, the second field with tokens and analyzed that works fine. I try to search "hello world" in the first field and I get 2 results, the same search in second field returns 3 fields (following the example). The same happens when I try to search strings with only one token, so the problem is not related about tokens. When I search by this second field I get the Documents that are not returned searching by the first field and all of them have very low score values even though the field value is the same.user2993510

2 Answers

1
votes

I'll just give you my suspicion here, I suppose. You say you are using WhitespaceAnalyzer, but since your fields are NOT_ANALYZED, that analyzer isn't doing anything to the indexed content. They are indexed precisely as they are, as a single token.

If you are indexing the value "hello there", searching with a TermQuery on "hello" won't find anything. Neither will it find anything if you have indexed "Hello", "hello!", or even "hello ". It will be case, punctuation, whitespace, etc. sensitive, and require a match on the entire input. So I suspect, that your un-found document has a problem along these lines.

1
votes

Lucene will resolve the search expression B:hello world to B:hello D:world, an expression of two terms. Here D is the default search field, probably "another Field" mentioned in your comment on @femtoRgon's answer.

I'm guessing the results include documents 1 and 3 because they match on token "world" in field D, but this token is absent from document 5 field D. But this is possible only if the default search operator is OR not AND, because B:hello cannot match these documents.

You may get the results you expect by using a phrase expression: B:"hello world". But you may not; WhitespaceAnalyzer will break this phrase into two tokens when it builds a Query object.

You could get around the problem by usingKeywordAnalyzer for field B, as described in my answer to another question.