0
votes

In a solr cloud setup, there are 8 solr nodes and 3 zookeeper nodes. There is one load balancer that gets all the indexing and search queries and distributes them to these 8 solr nodes in solr cloud. Before sending the solr query to particular solr node, it first checks if the service endpoint is active. Only if it is active then it sends the request to that particular solr node. Zookeeper handles the elections of leaders in shard. In this setup, zookeeper is not handling the query distribution. Is this set-up bad for distributed queries? What other functionality offered by solrcloud is missed due load balancer doing the work of query distribution.

Please note that, load balancer is necessary because there are different clients (Java, Ruby, JavaScript) accessing the solr service. Only SolrJ has the ability to communicate with zookeeper using CloudSolrServer class). Also, it helps to scale zookeeper nodes without changing any setting from client side.

1

1 Answers

1
votes

The SolrJ CloudSolrClient has a couple of advantages:

  1. Node autodiscovery: It always knows what nodes are in the cluster, using the same ZK mechanism that the SolrCloud cluster itself uses.

  2. Query-specific routing: Although any request can go to any node in the SolrCloud cluster, many of these will result in a simple proxy to the actual node that should handle the request

    2a: Indexing requests are routed directly to the leader of the shard handling that document's id. For a bulk-insert request, this can mean several sub-requests, farming out batches of documents directly to each appropriate shard.

    2b: Queries to a collection are routed to a node that has a shard from that collection.

    The CloudSolrClient already knows this stuff and routes directly, avoiding the proxy request within the cluster.

All that said, the internal routing requests are pretty lightweight. You'll add some latency to the requests, increase internal network bandwidth, and add the tiniest bit of CPU usage to the SolrCloud cluster.

So what I'm saying is that if it's too difficult to reproduce these advantages, Solr will handle things, and you'll probably get by just fine without them.