2
votes

Lucene NOOB alert!

I consider myself to be a human of at least reasonable intelligence, however I am having enormous problems mentally grokking the query types within Lucene.

In my particular instance I need to search a single string field in my document that is of only moedrate length (avg around 50 chars).

I want the user to be able to type the start of words within the item they are searching for. And I also want to not have to dictate the order they provide the terms.

Example field : "generic brand strength"

Should match searches : "generic brand strength" "brand generic strength" ... "gen bran str" "bran generic str" ... etc.

It is possible for me to store my information (each word in the example) in seperate fields if that would help, but I am not convinced that it would.

I am currently lost in a world of Fuzzy Wildcards and Multi-term Phrases.

Can anyone clarify this whole scenario for me? (And yes, I have looked extensively online for help but cannot find a decent resource).

BTW I am using Lucene 2.9 but I don't think that really matters.

1

1 Answers

4
votes

You need not store each term within a separate field. Lucene creates tokens out of each term (if you are using a whitespace tokenizer) hence allows for great flexibility of search.

To your question about:

Example field : "generic brand strength"

Should match searches : "generic brand strength" "brand generic strength"

The above query will return both the results, the latter with a lower score for obvious reasons. However, "gen bran str" "bran generic str" ... etc. is tricky, since it appears the terms are not standard "stems" in which case you can use a stemmer analyzer.

The simplest approach would be to:

  1. Split your query phrase by the white space, so you have a string[]
  2. Use a Booleanquery and create a query for each term appending a wildcard at the end.

Something like:

string[] terms = query.split(" ");
BooleanQuery bq = new BooleanQuery();

foreach(string term in terms)
 bq.Add(new Query("FieldName", term + "*",...);

There are better query types such as SpanQuery, DisMax etc. , but since you mentioned a noob alert, think the above is simplest (although prolly not most elegant) approach.

HTH