283
votes

With the NoSQL movement growing based on document-based databases, I've looked at MongoDB lately. I have noticed a striking similarity with how to treat items as "Documents", just like Lucene does (and users of Solr).

So, the question: Why would you want to use NoSQL (MongoDB, Cassandra, CouchDB, etc) over Lucene (or Solr) as your "database"?

What I am (and I am sure others are) looking for in an answer is some deep-dive comparisons of them. Let's skip over relational database discussions all together, as they serve a different purpose.

Lucene gives some serious advantages, such as powerful searching and weight systems. Not to mention facets in Solr (which Solr is being integrated into Lucene soon, yay!). You can use Lucene documents to store IDs, and access the documents as such just like MongoDB. Mix it with Solr, and you now get a WebService-based, load balanced solution.

You can even throw in a comparison of out-of-proc cache providers such as Velocity or MemCached when talking about similar data storing and scalability of MongoDB.

The restrictions around MongoDB reminds me of using MemCached, but I can use Microsoft's Velocity and have more grouping and list collection power over MongoDB (I think). Can't get any faster or scalable than caching data in memory. Even Lucene has a memory provider.

MongoDB (and others) do have some advantages, such as the ease of use of their API. New up a document, create an id, and store it. Done. Nice and easy.

10
Thank you, but that does not answer my question: which is, why would I use MongoDB instead of Lucene for my database? They both handle documents, but Lucene has some very powerful search options. +1 though for actually finding a related question. I search several times on Stackoverflow, and did not come up with a near comparison.eduncan911
How are you using Lucene that it provides functionality similar to MongoDB? Are you tying it to a relational DB for storage?Philip Tinney
@Philip: It's a hypothetical question. Why not use Lucene as your document storage? You get a lot more searching power and scalability (when mixed with Solr, making Lucene even easier to use).eduncan911

10 Answers

257
votes

This is a great question, something I have pondered over quite a bit. I will summarize my lessons learned:

  1. You can easily use Lucene/Solr in lieu of MongoDB for pretty much all situations, but not vice versa. Grant Ingersoll's post sums it up here.

  2. MongoDB etc. seem to serve a purpose where there is no requirement of searching and/or faceting. It appears to be a simpler and arguably easier transition for programmers detoxing from the RDBMS world. Unless one's used to it Lucene & Solr have a steeper learning curve.

  3. There aren't many examples of using Lucene/Solr as a datastore, but Guardian has made some headway and summarize this in an excellent slide-deck, but they too are non-committal on totally jumping on Solr bandwagon and "investigating" combining Solr with CouchDB.

  4. Finally, I will offer our experience, unfortunately cannot reveal much about the business-case. We work on the scale of several TB of data, a near real-time application. After investigating various combinations, decided to stick with Solr. No regrets thus far (6-months & counting) and see no reason to switch to some other.

Summary: if you do not have a search requirement, Mongo offers a simple & powerful approach. However if search is key to your offering, you are likely better off sticking to one tech (Solr/Lucene) and optimizing the heck out of it - fewer moving parts.

My 2 cents, hope that helped.

36
votes

You can't partially update a document in solr. You have to re-post all of the fields in order to update a document.

And performance matters. If you do not commit, your change to solr does not take effect, if you commit every time, performance suffers.

There is no transaction in solr.

As solr has these disadvantages, some times NoSQL is a better choice.

UPDATE: Solr 4+ Started supporting commit and soft-commits. Refer to the latest document https://lucene.apache.org/solr/guide/8_5/

29
votes

We use MongoDB and Solr together and they perform well. You can find my blog post here where i described how we use this technologies together. Here's an excerpt:

[...] However we observe that query performance of Solr decreases when index size increases. We realized that the best solution is to use both Solr and Mongo DB together. Then, we integrate Solr with MongoDB by storing contents into the MongoDB and creating index using Solr for full-text search. We only store the unique id for each document in Solr index and retrieve actual content from MongoDB after searching on Solr. Getting documents from MongoDB is faster than Solr because there is no analyzers, scoring etc. [...]

24
votes

Also please note that some people have integrated Solr/Lucene into Mongo by having all indexes be stored in Solr and also monitoring oplog operations and cascading relevant updates into Solr.

With this hybrid approach you can really have the best of both worlds with capabilities such as full text search and fast reads with a reliable datastore that can also have blazing write speed.

It's a bit technical to setup but there are lots of oplog tailers that can integrate into solr. Check out what rangespan did in this article.

http://denormalised.com/home/mongodb-pub-sub-using-the-replication-oplog.html

12
votes

From my experience with both, Mongo is great for simple, straight-forward usage. The main Mongo disadvantage we've suffered is the poor performance on unanticipated queries (you cannot created mongo indexes for all the possible filter/sort combinations, you simple can't).

And here where Lucene/Solr prevails big time, especially with the FilterQuery caching, Performance is outstanding.

11
votes

Since no one else mentioned it, let me add that MongoDB is schema-less, whereas Solr enforces a schema. So, if the fields of your documents are likely to change, that's one reason to choose MongoDB over Solr.

5
votes

@mauricio-scheffer mentioned Solr 4 - for those interested in that, LucidWorks is describing Solr 4 as "the NoSQL Search Server" and there's a video at http://www.lucidworks.com/webinar-solr-4-the-nosql-search-server/ where they go into detail on the NoSQL(ish) features. (The -ish is for their version of schemaless actually being a dynamic schema.)

1
votes

If you just want to store data using key-value format, Lucene is not recommended because its inverted index will waste too much disk spaces. And with the data saving in disk, its performance is much slower than NoSQL databases such as redis because redis save data in RAM. The most advantage for Lucene is it supports much of queries, so fuzzy queries can be supported.

1
votes

The third party solutions, like a mongo op-log tail are attractive. Some thoughts or questions remain about whether the solutions could be tightly integrated, assuming a development/architecture perspective. I don't expect to see a tightly integrated solution for these features for a few reasons (somewhat speculative and subject to clarification and not up to date with development efforts):

  • mongo is c++, lucene/solr are java
  • lucene supports various doc formats
    • mongo is focused on JSON (BSON)
  • lucene uses immutable documents
    • single field updates are an issue, if they are available
  • lucene indexes are immutable with complex merge ops
  • mongo queries are javascript
  • mongo has no text analyzers / tokenizers (AFAIK)
  • mongo doc sizes are limited, that might go against the grain for lucene
  • mongo aggregation ops may have no place in lucene
    • lucene has options to store fields across docs, but that's not the same thing
    • solr somehow provides aggregation/stats and SQL/graph queries
1
votes

MongoDB Atlas will have a lucene-based search engine soon. The big announcement was made at this week's MongoDB World 2019 conference. This is a great way to encourage more usage of their high revenue MongoDB Atlas product.

I was hoping to see it rolled into the MongoDB Enterprise version 4.2 but there's been no news of bringing it to their on-prem product line.

More info here: https://www.mongodb.com/atlas/full-text-search