24
votes

Am building a "Book search" API using Lucene. I need to index Book Name,Author, and Book category fields in Lucene index.

A single book can fall under multiple distinct book categories...for example:

BookName1 --fiction,humour,philosophy. BookName1 --fiction,science. BookName1 --humour,business. BookName4-humour and so on.....

User should be able to search all the books under a particular category say "homour".

Given this situation, how do i index above fields and build the query in lucene?

3

3 Answers

32
votes

You can have a field for a Lucene document occur multiple times. Create the document, add the values for the the name and author, then do the same for each category

  • create new lucene document
  • add name field and value
  • add author field and value
  • for each category:
    • add category field and value
  • add document to index

When you search the index for a category, it will return all documents that have a category field with the value you're after. The category should be a 'Keyword' field.

I've written it in english because the specific code is slightly different per lucene version.

5
votes

You can create a simple "category" field, where you list all categrories for a book seperated by spaces.

Then you can search something like:

stock market AND category:(+"business")

Or if you want to search in more than one category

stock market AND category:(+"business" +"philosophy")
4
votes

I would use Solr instead - it's built on Lucene and managed by the ASF, but is much, much easier to use than Lucene, especially for newcomers.

If offers pretty much all the mainline features of Lucene (certainly everything you'll need for the project you describe), plus extra things like snapshotting, replication, schemas, ...

In Solr, you would simply define the fields you want to index something like this in schema.xml:

<field name="book_id" type="string" indexed="true" stored="true" required="true" multiValued='false'/>
<field name="book_name" type="text" indexed="true" stored="true" required="true" multiValued='false' />
<field name="book_authors" type="text" indexed="true" stored="true" required="true" multiValued='true' />
<field name="book_categories" type="textTight" indexed="true" stored="true" required="true" multiValued='true' />

Note that the multiValued='true' attribute lets you effective pass an array or list to this field, which gets split and indexed nicely by Solr.

Once you have this, start up Solr and you can ask queries like "book_authors:Hemingway" or "book_categories:Romance book_categories:Mills".

There are several query handlers pre-written and configured for you to do things like parse complex queries (fuzzy matches, boolean operations, scoring boosts, ...), and as Solr's API is exposed over HTTP, all this is wrapped by a number of client libraries, so you don't need to handle the low-level details of crafting queries yourself.

There is lots of great documentation on their website to get you started.