Combining Solr 3x-style Master/Slave "Repeater" to feed remote 4x SolrCloud instances?

Question

Solr 3x "Repeaters" and Multiple Data Centers:

Solr 3x let a node behave as both a slave and master, pull from one master, and then feed copies downstream to its own slaves. This was so common/useful it even had a name, a "Repeater".

This was useful if you wanted span multiple data centers. You could have the real master in data center A (DCA), and a "repeater" in data center B (DCB). That repeater would then grab content from DCA and feed all of the other nodes in DCB, saving on bandwidth.

Suppose you want to upgrade this setup to Solr 4x and SolrCloud. (Note that Solr 4x still supports Solr 3x-style legacy replication)

It's said that you should NOT have a single SolrCloud cluster span disparate data centers. So data center B should have it's own SolrCloud.

One idea is to have the DCA -> DCB link still use Solr 3x-style Master/Slave replication. And then the "repeater" in DCB, being also a SolrCloud node, would automatically be propagated to other nodes.

Main question:

Can a Solr node participate in both Solr 3x-style master/slave mode (as a slave) and also be part of a SolrCloud cluster? And if so, how is this configured?

Complications:

In the simple case, if it's just 1 shard with replicas, it's easy to see how that might work in terms of data. It's a little less clear if you have multiple shards in DCB, how do I tell each shard to only replicate its own share of data? Note that SolrCloud normally replicates via transactions, whereas 3x uses binary indices.

Another complexity is if you're doing replication. How do you tell just the master node for each shard to pull from the remote DCA node?

Alternatives:

On solution is to upgrade to 4x but continue using 3x-style replication in DCB, so just don't use SolrCloud.

I realize that another solution would be to have the data feed send it's updates to both data centers, or usE something like RabbitMQ. For the sake of this question, let's assume thats not an option (long story...)

Maybe there's some other way I haven't thought of?

Has anybody actually tried having SolrCloud span data centers? How horrible is it?

Somebody must have asked this question before!

But I've looked on Google and, although it finds tons of pages with the keywords, I haven't seen this specific "hybrid" mode fleshed out. I found one thread from 2013 but it didn't really talk about the configuration and complexity.

One of my cohorts has suggested that just not using SolrCloud mode is the way to go, though he admits you lose some of the benefits. — Mark Bennett
How big is the document set you are indexing (number of docs and rough average document size). What's the normal insertion/update frequency per day? Will you process many deletes? — John Petrone
Thanks @JohnPetrone for the questions; I don't have all the answers yet, it's still a very preliminary project. I think it's under 20 million docs, so I was suspecting that a single shard (replicated) might be able to handle it. — Mark Bennett

John Petrone John Petrone · Accepted Answer · 2014-07-13T21:44:59

To answer your first question, a Solr slave in 3.X style cannot be a node in a Solr Cloud. The reason is the slave in a master/slave 3.X Solr config simply replicates, byte for byte, all the index files on the master. That's all it does. It can, in the repeater config, then also be a master for others to replicate from, or be a dedicated query slave or both. But that's it.

A node in a Solr Cloud config is a full participant in a distributed computing cluster where indexing is generally intended to be distributed across all nodes, and all nodes participate in queries. It's a very powerful feature which automatically handles failed nodes and significantly eases the work load of scaling up that was very manual in 3.X style.

However, part of what you pay for that is increased complexity (Zookeeper), requirements for lower latency inter-node communications (because all the nodes now talk to each other and to Zookeeper) and the loss of the simplicity of Master/Slave replication.

At 20M docs you are well within the constraints of a single node master index with an effectively unlimited number of slaves and therefor very high query capacity. I do this today with a production environment where each master has on the order of 60M docs in it with no significant problems.

The question is do you need NRT, multi-node indexing, automated failover, the ability to autoscale well past 100M docs? If so then Master/Slave it probably not going to work for you.

You could take a look at writing the same data to two different Solr Cloud clusters, one in each datacenter. You could do that directly, or use something like Apache Flume to do it for you - in either there are some issues with doing this and so the real question is are dealing with those issues worth it to get the added benefit of Solr Cloud?

Combining Solr 3x-style Master/Slave "Repeater" to feed remote 4x SolrCloud instances?

1 Answers