I would like to index a set of documents that will contain semi structured data, typically key-value pairs something like @author Joe Bloggs
. These keywords should then be available as searchable attributes of the document which can be queried individually.
I have been looking at Lucene and I'm able to build an index over the documents I'm interested in but I'm not sure how best to proceed with the next step of keyword extraction.
Is there a common approach for doing this in Lucene or another indexing system? I'd like to be able to search over the documents using a typical word search as I'm able to already, and so would like a something more than a custom regex extraction.
Any help would be greatly appreciated.
Niall