I am trying to evaluate solr for one of my project for which i need to check the scalability in terms of tps(transaction per second) for my application. I have configured solr on 1 AWS server as standalone application which is giving me a search query tps of ~8000 for my query. In order to test the scalability, i have done sharding of the same data across two AWS servers with 2.5 milion records each .When i try to query the cluster with the same query as before it gives me a tps of ~2500 . My understanding is the tps should have been increased in a cluster as these are two different machines which will perform separate I/O operations. I am using the query REST endpoint provided by solr. I have not configured any seperate load balancer as the solr documentation says that by default solr cloud will perform load balancing in a round robin fashion. Appreciate any help in validating my understanding.
1 Answers
What you're seeing is the overhead of Solr having to route (and possibly merge) the request across your cluster, compared to just answering the query locally from the server you're querying.
When querying a single Solr server, that server has all its data available locally on disk, so your query can be processed within the server and without having to ask any other servers about data. Querying collection1
on the server will give a request flow close to:
client -> request -> server (disk) -> response -> client
When you introduce another server into the cluster, but still keeps querying the first server, that server might have to ask the other server about data as well - unless the server you're talking to (the first one) has all the documents in the collection you're querying.
So let's say that all the documents for collection2
are located on the second server, but you're still talking to the first server:
client -> request -> server1
(doesn't have the documents, so it'll ask the node that has)
server1 -> request -> server2
server1 <- response <- server2
client <- response <- server1
As you can see, the request is now dependent on a completely new request between server1 and server2, and your throughput goes down as the complexity of the retrieval has gone up. The API is still the same, but what's happening inside Solr has introduced another layer of requests behind your request.
So why does using a cluster aware client (for example SolrJ with CloudSolrClient
) help? It starts off by retrieving the cluster configuration from ZooKeeper, and then querying the second server directly - as it can see that collection2
lives on server2
.
client -> request -> zookeeper -> response -> client
client -> request -> server2 -> response -> client
When you're making many queries after each other (as you do when you're benchmarking, indexing or querying with a heavy load in an application), only the second row repeats, so the next query can be resolved without querying ZooKeeper first unless an error occurs:
client -> request -> server2 -> response -> client
That's why you're seeing great throughput again when you're using a cloud aware client.
But there is one more case, which is what happens when the collection doesn't exist on just one single server, and will again affect your throughput. collection3
exists on server2, server3 and server4, and is sharded across the servers (meaning that each server only has a section of the index, for example 33.3.. % of the documents on each server)
client -> server1
server1 -> server2 | parallel
<- |
server1 -> server3 |
<- |
server1 -> server4 |
<- |
(merge responses from server2, server3 and server4
and return the new, sorted response, cutting it at
the number of rows to return)
client <- server1
A cloud aware client can skip querying server1 and ask server2 directly, saving one request (or if it really wants to, query the nodes in parallel itself and merge the responses, but I don't think any of the clients does this at the moment).
The last question about how you can scale if the client needs to be cloud aware - it doesn't need to be. It'll save a roundtrip, but you're seeing the numbers change a lot because of the huge change between asking the disk locally and having to query across the network. If you add another server to the mix and still query the first Solr node, the throughput will remain the same - it will not go further down. It's just querying server3
instead of server2
when fetching your collection results.
Hope that explains the numbers you're seeing!