Solr nodes hang in recovering state after full indexing - solr write throttling

Question

I am running Solr 7.6 with nine replicas and one shard.

When we run our full indexing, few of our nodes go to recovery mode and stuck in the recovery state forever.

We have a total of 90k parent docs, and each parent doc has 300 children.

parent doc size: 15kB
child doc size: 500B
total time of full indexing: 36-39 mins
batch size: max 1000(parent docs which include 300 children each) = 1000*300
The number of threads used for full indexing: 10
Average total docs indexed/second: 2400 Parent docs * 300 children

commit setting:

autosoftcommit maxtime: 30s
autocommit maxtime: 1min
numRecordsToKeep: 100

Each of the ten threads fetches the data from Cassandra and creates the document for indexing, once the thread has 1000 parent docs(with 300 children) ready for indexing in its buffer list, it pushes the data to Solr using update API.

With the above settings, 2-3 nodes go to recovery state when I run the full indexing job.

I have a few questions:

What would be the number of records that I can index/second for single shard solr cluster, with my document size?
Do I need to reduce the number of threads? or the batch size?

1000 parent docs * 20kb + 300,000 child docs * 4kb makes 1.2GB of (JSON?) data that you dump in one shot onto a node? Have you had a look at the Solr logs? I would not be surprised so see OOMs all over the place. Though the speed is absolutely impressive 2400*300*4kb/s=2.8GB/s. What kind of storage are you using? None at all, do you keep everything in memory? — Harald
@Herald Sorry, my calculation of child documents was way off. I have updated post with correct values. Child doc size is 500 Bytes, and parent document size is 15KB — Bikas Katwal
This sounds more reasonable. Still I think 165MB on one api call quite big. Just imagine: the whole JSON is shipped into Solr, where it may be converted to a JSON DOM tree (not sure about this, though), so it may easily go to twice or three times the size in RAM. This times 10 threads and you'll blow a 5GB heap easily. Going for smaller chunks may actually help. What tell the Solr logs on the machines that fail at the time they start recovery? Any hints about CPU/IO/network overload? I am used to much bigger documents (office mix) and we usually have indexing rates in the low 10-100 dps. — Harald

ipekhov ipekhov · Accepted Answer · 2020-06-05T20:16:37

We had a similar issue with a couple of our Solr projects that had bulk updates submitted with a few threads running at the same time. We were able to resolve that by stopping all Solr instances in the SolrCloud, and restarting the updates with one thread. For some reason, Solr sometimes becomes unable to keep its follower replicas up-to-date with the leader if more than one simultaneous processes submit the updates.

Solr nodes hang in recovering state after full indexing - solr write throttling

1 Answers