Solr Cloud performance

Question

I am trying to evaluate solr for one of my project for which i need to check the scalability in terms of tps(transaction per second) for my application. I have configured solr on 1 AWS server as standalone application which is giving me a search query tps of ~8000 for my query. In order to test the scalability, i have done sharding of the same data across two AWS servers with 2.5 milion records each .When i try to query the cluster with the same query as before it gives me a tps of ~2500 . My understanding is the tps should have been increased in a cluster as these are two different machines which will perform separate I/O operations. I am using the query REST endpoint provided by solr. I have not configured any seperate load balancer as the solr documentation says that by default solr cloud will perform load balancing in a round robin fashion. Appreciate any help in validating my understanding.

How are you querying the cluster? When you go from one node to two, your queries will have to be distributed from the node you're querying, resulting in more network troughput. The interesting question is usually what happens when you introduce node 3, 4, 5, etc., as just going from one to two introduces all the complexities of handling distributed queries. — MatsLindh
I am querying the cluster via the /select REST endpoints provided by solr. Do you think network latency might degrade the performance to this extent? — Anshul Sharma
But are you using a cluster aware client? Otherwise Solr will perform routing from the server you're querying to find the nodes that contain the document, so it'll have (additional) network overhead on the Solr side. While a request could previously be served by request - server - response, it's now request - server - request - server - response - server -response, merging results where appropriate. — MatsLindh
I can use a cluster aware client, which gives me good tps.So with solr does the client needs to be aware of the data in shards so that it can directly query the specific shard? How do i scale if i always have to query the cluster incase my client is not cluster aware? — Anshul Sharma
I have added an explanation as an actual answer, as it needs a bit more space :-) — MatsLindh

MatsLindh MatsLindh · Accepted Answer · 2015-12-22T08:57:42

What you're seeing is the overhead of Solr having to route (and possibly merge) the request across your cluster, compared to just answering the query locally from the server you're querying.

When querying a single Solr server, that server has all its data available locally on disk, so your query can be processed within the server and without having to ask any other servers about data. Querying collection1 on the server will give a request flow close to:

client -> request -> server (disk) -> response -> client

When you introduce another server into the cluster, but still keeps querying the first server, that server might have to ask the other server about data as well - unless the server you're talking to (the first one) has all the documents in the collection you're querying.

So let's say that all the documents for collection2 are located on the second server, but you're still talking to the first server:

client -> request  -> server1
          (doesn't have the documents, so it'll ask the node that has)
                      server1 -> request  -> server2
                      server1 <- response <- server2
client <- response <- server1

As you can see, the request is now dependent on a completely new request between server1 and server2, and your throughput goes down as the complexity of the retrieval has gone up. The API is still the same, but what's happening inside Solr has introduced another layer of requests behind your request.

So why does using a cluster aware client (for example SolrJ with CloudSolrClient) help? It starts off by retrieving the cluster configuration from ZooKeeper, and then querying the second server directly - as it can see that collection2 lives on server2.

client -> request -> zookeeper -> response -> client   
client -> request -> server2 -> response -> client

When you're making many queries after each other (as you do when you're benchmarking, indexing or querying with a heavy load in an application), only the second row repeats, so the next query can be resolved without querying ZooKeeper first unless an error occurs:

client -> request -> server2 -> response -> client

That's why you're seeing great throughput again when you're using a cloud aware client.

But there is one more case, which is what happens when the collection doesn't exist on just one single server, and will again affect your throughput. collection3 exists on server2, server3 and server4, and is sharded across the servers (meaning that each server only has a section of the index, for example 33.3.. % of the documents on each server)

client -> server1
          server1 -> server2    | parallel
                  <-            |
          server1 -> server3    | 
                  <-            |
          server1 -> server4    |
                  <-            |
              (merge responses from server2, server3 and server4
               and return the new, sorted response, cutting it at
               the number of rows to return)
client <- server1

A cloud aware client can skip querying server1 and ask server2 directly, saving one request (or if it really wants to, query the nodes in parallel itself and merge the responses, but I don't think any of the clients does this at the moment).

The last question about how you can scale if the client needs to be cloud aware - it doesn't need to be. It'll save a roundtrip, but you're seeing the numbers change a lot because of the huge change between asking the disk locally and having to query across the network. If you add another server to the mix and still query the first Solr node, the throughput will remain the same - it will not go further down. It's just querying server3 instead of server2 when fetching your collection results.

Hope that explains the numbers you're seeing!

Solr Cloud performance

1 Answers