I’ve found multiple questions on SO and elsewhere that ask questions along the lines of “How can I index and then search relational data in Lucene”. Quite rightly these questions are met with the standard response that Lucene is not designed to model data like this. This quote I found sums it up…
A Lucene Index is a Document Store. In a Document Store, a single document represents a single concept with all necessary data stored to represent that concept (compared to that same concept being spread across multiple tables in an RDBMS requiring several joins to re-create).
So I will not ask that question and instead provide my high level requirements and see if any Lucene gurus out there can help me.
- We have data on People (Name, Gender, DOB, Nationality, etc)
- And data on Companies (Name, Country, City, etc).
- We also have data about how these two types of entity relate to each other where a person worked at the company (Person, Company, Role, Date Started, Date Ended, etc).
We have two entities – Person and Company – that have their own properties and then properties exist for the many-to-many link between them.
Some example searches could be as follows…
- Find all Companies in Australia
- Find all People born between two dates
- Find all People who have worked as a .Net Developer
- Find all males who have worked as a.Net Developer in London.
- Find all People who have worked as a .Net Developer between 2008 and 2010
The criteria span all the three sets of data. Our requirement is to provide a Faceted Search over the data that accepts any combination of the various properties, of which I have given some examples.
I would like to use Lucene.Net for this. We are a .Net software house and so feel slightly intimidated by java. However, all suggestions are welcome.
I am aware of the idea that the Index should be constructed with the search in mind. But I can’t seem to come up with a sensible index that would meet all the combinations of search criteria
- What classes native to Lucene or what extension points can we make use of.
- Are there are established techniques for doing this kind of thing?
- Are there any third open source contributions that I have missed that will help us here?
For now I won’t describe the scenarios we have considered because I don’t want to bloat out this question and make it too intimidating. Please ask me to elaborate where necessary.
embedded databases
– L.B