0
votes

I would like to index a set of documents that will contain semi structured data, typically key-value pairs something like @author Joe Bloggs. These keywords should then be available as searchable attributes of the document which can be queried individually.

I have been looking at Lucene and I'm able to build an index over the documents I'm interested in but I'm not sure how best to proceed with the next step of keyword extraction.

Is there a common approach for doing this in Lucene or another indexing system? I'd like to be able to search over the documents using a typical word search as I'm able to already, and so would like a something more than a custom regex extraction.

Any help would be greatly appreciated.

Niall

1

1 Answers

1
votes

I wrote a source code search engine using Lucene as part of my bachelor thesis. One of the key features, was that the source code was treated as structured information, and therefore should be searchable as such, i.e. searchable according to attributes as you describe above.

Here you can find more information about this project. If that is to extensive for you, I can sum up some things:

  • I created separate searching fields for all the attributes which should be searchable. In my case those where for example 'method name' or 'commentary' or 'class name'.
  • It can be advantageous to have the content of these fields overlap, however this will blow up your database index (but only linearly with the occurrence of redundant data in searchable fields).