1
votes

I've got a database with a lot of books in it. I've got fields like title, descriptions, authors etc.

I'm indexing title with a boost of 100f and description with a boost of 0.1f, both fields tokenized and stemmed.

I'm searching with a single input field, that searches in all available fields using a booleanquery joined with BooleanClause.Occur.SHOULD and containing a wildcardquery for each field. I also remove all "stopwords" from the query to start with.

The problem i'm having is when i search for the string without the quotes

"de wetenschap van het leven", after removing the stop words i get "wetenschap leven"

The Title query becomes "*wetenschap* *leven*", the description query the same, with a wrapping booleanquery joined with BooleanClause.Occur.SHOULD.

The following books are in the db

  • Wetenschappelijk denken. Een inleiding voor de medische en biomedische wetenschappen en voor de andere levenswetenschap.
  • De wetenschap van de aarde. Over een levende planeet
  • Atlas van de menselijke levensloop
  • De wetenschap van het leven. Over eenheid in biologische diversiteit

The book return in the first 4 books, that's good, but in this implementation we cut off at 3 and the rest is below a read more link. Just upping the cutoff is not an option

For me, the "De wetenschap van het leven. Over eenheid in biologische diversiteit" book matches the query "more" then the others (or so i feel), but i'm unable to find the correct index/search combination to make this work. Does anyone have an idea?

3

3 Answers

2
votes

A few suggestions:

  1. Do not remove stop words - they seem to be an important part of your search query.
  2. Do not use wildcards - search just for the words you need. I believe the best will be to use a PhraseQuery - e.g. "de wetenschap van het leven".
  3. Do not search past sentence end. This is tougher - you may need to index each sentence separately.
  4. Read Debugging Relevance Issues in Search - you will probably get other ideas there.
1
votes

I think a SpanQuery (specifically a SpanNearQuery) might be what you need.

Given a document "a quick brown fox jumps over a lazy dog"

it can find a match for "brown fox " and "lazy dog". You can adjust the slop setting to adjust the distance between the two search query phrases/terms....in short, it gives you a lot of tools to tweak your search.

Also unfamiliar with dutch(?) language you might want to stem your queries if possible, and avoid leading wildcards - they are quite expensive and lead to lower precision and recall.

0
votes

I improved the relevance by adding a phrase search for the entire string as well. This way we still get the "search in everything" behavior and the titles are a lot more relevant then the rest.