1
votes

I’ve found multiple questions on SO and elsewhere that ask questions along the lines of “How can I index and then search relational data in Lucene”. Quite rightly these questions are met with the standard response that Lucene is not designed to model data like this. This quote I found sums it up…

A Lucene Index is a Document Store. In a Document Store, a single document represents a single concept with all necessary data stored to represent that concept (compared to that same concept being spread across multiple tables in an RDBMS requiring several joins to re-create).

So I will not ask that question and instead provide my high level requirements and see if any Lucene gurus out there can help me.

  • We have data on People (Name, Gender, DOB, Nationality, etc)
  • And data on Companies (Name, Country, City, etc).
  • We also have data about how these two types of entity relate to each other where a person worked at the company (Person, Company, Role, Date Started, Date Ended, etc).

We have two entities – Person and Company – that have their own properties and then properties exist for the many-to-many link between them.

Some example searches could be as follows…

  • Find all Companies in Australia
  • Find all People born between two dates
  • Find all People who have worked as a .Net Developer
  • Find all males who have worked as a.Net Developer in London.
  • Find all People who have worked as a .Net Developer between 2008 and 2010

The criteria span all the three sets of data. Our requirement is to provide a Faceted Search over the data that accepts any combination of the various properties, of which I have given some examples.

I would like to use Lucene.Net for this. We are a .Net software house and so feel slightly intimidated by java. However, all suggestions are welcome.

I am aware of the idea that the Index should be constructed with the search in mind. But I can’t seem to come up with a sensible index that would meet all the combinations of search criteria

  • What classes native to Lucene or what extension points can we make use of.
  • Are there are established techniques for doing this kind of thing?
  • Are there any third open source contributions that I have missed that will help us here?

For now I won’t describe the scenarios we have considered because I don’t want to bloat out this question and make it too intimidating. Please ask me to elaborate where necessary.

1
I don't think that Lucene.Net(or any other text search engine) is very suitable for your needs. Maybe you should go with embedded databasesL.B
Consider asking this on the [email protected] mailing listPrescott
I second @Prescott's suggestion. It's a friendly list and they're willing to help out if you provide good enough info (which you did here). 1 suggestion (don't have much time now) you state: "But I can’t seem to come up with a sensible index that would meet all the combinations of search criteria".. That really isn't necessary. If you can't get it to work with 1 conceptual document-type.. (e.g: people with flattened companies) use 2 (companies with flattend people), etc.. (overly simplified btw). I have zero knowledge on the .Net port. If I were you, I'd omit that in the question to the listGeert-Jan
Continued: just to get the best possible solutions going. Afterwards you can always check if it's supported in the .net variant. (or you might end up running the java-variant as a standalone server just communicating over http from .net if that's within spec.)Geert-Jan
Many thanks for the mailing list suggestion. I have asked on there too.Andy McCluggage

1 Answers

2
votes

To store both companies and people in a single index, you could create documents with a type field that identifies the type of entities they describe.

Birthdays can be stored as date fields.

You could give each person a simple text field containing the names of companies that they worked for. Note that you won't get an error if you enter a company that is not represented by a document in your index. Lucene is not a relational DB tool, but you knew that.

(Sorry that I've not posted any links to the API; I'm familiar with Lucene Core but not Lucene.NET.)