3
votes

Given a search engine like Lucene and a set of XML documents which need to be fully preserved, what are the advantages and disadvantages of using the search engine as key value store for returning XML doucments given a unique primary key which each document contains?

3
I can't make out what you mean. Is the idea that you do conventional full-text indexing, or that you use some sort of schema mapping to turn data items from XML into many fields in Lucene?bmargulies
I want to be able to fetch (or reconstruct I suppose, but preferably fetch) the original document I put in by a simple query for an unambiguous unique key for that given document. Basically I want to treat the search engine as a SQL database with a primary key field and a clob field for each row.Jherico
Whoever voted to close as not a real question: like hell. My boss specifically wants me to come up with the pros and cons of doing this as an implementation of a KV store for documents (as opposed to using the filesystem or something like couch db)Jherico
Perhaps you might edit the question to include the material in the your comment?bmargulies
You need ACID (or near ACID) sematics which Lucene doesn't guarantee. To put it simply, DB can recover from failure to last consistent state. A crash while writing to Lucene index might render it useless.Shashikant Kore

3 Answers

2
votes

If you use something like Compass, and it's XML-to-Lucene mapping engine, it's a great solution for storing and querying XML documents, without going all the way to a XML database.

One downside is that the XML documents can only be retrieved via the Lucene API (the underlying data store is pretty impenetrable), but I can live with that.

2
votes

Read Search Engine versus DBMS. IMO, your application falls in the DBMS realm, and will probably be best served by a key-value database, such as couchDB. This is because you take no advantage of textual operations such as tokenization, stemming etc.

0
votes

If all you are going to do is test for key equality and retrieve a blob, Lucene has no visible advantage over, say, bdb. And you have no transactions until you layer something else on top. And concurrency has certain complexities to it. And the API is, well, a bit baroque for the simple thing you are doing.

I've implemented something like what you describe, but actual full text search on the data was a critical requirement that justified the rest.