I'll try to reduce my case to the necessary: I'm building a Webapp (with Spring
) with a search interface that lets you search a corpus of annotated/tagged texts. In my DB (MongoDB
) one document represents one page of a book collection (totaling ~8000 pages).
Here is an example of the Document structure in JSON (I removed a lot of meta data for brevity. Also, and this is important, the "tokens"-array contains up to 700 objects in most cases.):
{
"_id" : ObjectId("5622c29eef86d3c2f23fd62c"),
"scanId" : "592ea208b6d108ee5ae63f79",
"volume" : "Volume I",
"chapters" : [
"Some Chapter Name"
],
"languages" : [
"English",
"German"
],
"tokens" : [
{
"form" : "The",
"index" : 0,
"tags" : [
"ART"
]
},
{
"form" : "house",
"index" : 1,
"tags" : [
"NN",
"NN_P"
]
},
{
"form" : "is",
"index" : 2,
"tags" : [
"V",
"CONJ_C"
]
}
]
}
So you see i don't have a plain text, here. I now want to build an index with Lucene to quickly search this DB. The problem is that i want to be able to search certain words, their tags AND the context around it. Like "give me all documents containing the word 'House' tagged as 'NN' followed by a word tagged with 'V'.". I couldn't find a way to index these sub-structures with native Lucene functionality.
What i tried to do to at least be able to search for words and their tags is the following: In my Lucene index, a document doesn't represent a whole page, but only a word/token with it's tags. So one index document looks like this (expressed in JSON syntax for readability):
{
"token" : "house",
"tag" : "NN",
"tag" : "NN_P",
"index" : 1,
"pageId" : "5622c29eef86d3c2f23fd62c"
}
... Yes, Lucene allows me to use one field multiple times. So now i can search for a word and it's tags and get a reference to the page object in my DB via it's ID. But this is pretty ugly for two reasons: I now have two completely different document representations (DB and Lucene index) and to process a complex query like the one i mentioned above i'd have to query for the word and it's tag and then further check the context of the hits in the retrieved documents manually.
So my question is: Is there a way to index documents in Lucene containing fields/properties whose values are nested objects that in turn have certain properties?