2
votes

I'm using Lucene.Net to create a website to search books, articles, etc, stored as PDFs. I need to be able to filter my search results based on author name, for example. Can this be done with just Lucene? Or do I need a DB to store the filter fields for each document?

Also, what's the best way to index my documents? I'll have about 50 documents to start with and periodically I'll have to add a bunch of documents to the index--may be through a web form. Should I use a DB to store the document paths?

Thanks.

2

2 Answers

2
votes

Lucene has a couple of different Analyzers that can scrub out the noise and do "stemming" which is helpful when you want to do fulltext searching, but you're still going to need to store the PDF itself somewhere. Lucene.Net is happy to build an index on the file system, and you could add a field to the Document it builds called something like "PATH" with the path to the document.

2
votes

Here is a list of what you need to do IMO:

  1. Extract raw text from PDF - please see this question which recommends iTextSharp for this purpose.
  2. For each PDF document, create a Lucene.net document that has several fields: author, title, document text and whatever you want to search. It is recommended to also have a unique id field per document. I suggest you also store a field with the path to the original PDF document.
  3. After indexing all the documents, you will have a Lucene index you can search by fields.
  4. You can add new documents by repeating step 2. It is easier to do this offline - incremental updates are tough.