4
votes

We have a cluster of solr cloud server with 10 shards and 4 replicas in each shard in our stress environment. In our prod environment we will have 10 shards and 15 replicas in each shard. Our current commit settings are as follows

    *<autoSoftCommit>
        <maxDocs>500000</maxDocs>
        <maxTime>180000</maxTime>
    </autoSoftCommit>
    <autoCommit>
        <maxDocs>2000000</maxDocs>
        <maxTime>180000</maxTime>
        <openSearcher>false</openSearcher>
    </autoCommit>*

We indexed roughly 90 Million docs. We have two different ways to index documents a) Full indexing. It takes 4 hours to index 90 Million docs and the rate of docs coming to the searcher is around 6000 per second b) Incremental indexing. It takes an hour to indexed delta changes. Roughly there are 3 million changes and rate of docs coming to the searchers is 2500 per second

We have two collections search1 and search2. When we do full indexing , we do it in search2 collection while search1 is serving live traffic. After it finishes we swap the collection using aliases so that the search2 collection serves live traffic while search1 becomes available for next full indexing run. When we do incremental indexing we do it in the search1 collection which is serving live traffic.

All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz We have observed the following issue when we trigger indexing . In about 10 minutes after we trigger indexing on 14 parallel hosts, the replicas goes in to recovery mode. This happens to all the shards . In about 20 minutes more and more replicas start going into recovery mode. After about half an hour all replicas except the leader are in recovery mode. We cannot throttle the indexing load as that will increase our overall indexing time. So to overcome this issue, we remove all the replicas before we trigger the indexing and then add them back after the indexing finishes.

We observe the same behavior of replicas going into recovery when we do incremental indexing. We cannot remove replicas during our incremental indexing because it is also serving live traffic. We tried to throttle our indexing speed , however the cluster still goes into recovery .

If we leave the cluster as it , when the indexing finishes , it eventually recovers after a while. As it is serving live traffic we cannot have these replicas go into recovery mode because it degrades the search performance also , our tests have shown.

We have tried different commit settings like below

a) No auto soft commit, no auto hard commit and a commit triggered at the end of indexing b) No auto soft commit, yes auto hard commit and a commit in the end of indexing
c) Yes auto soft commit , no auto hard commit
d) Yes auto soft commit , yes auto hard commit
e) Different frequency setting for commits for above

Unfortunately all the above yields the same behavior . The replicas still goes in recovery We have increased the zookeeper timeout from 30 seconds to 5 minutes and the problem persists. Is there any setting that would fix this issue ?

1
All our searchers have 12 GB of RAM available and have quad core Intel(R) Xeon(R) CPU X5570 @ 2.93GHz. There is only one java process running i.e jboss and solr in it . All 12 GB is available as heap for the java process. We have observed that the heap memory of the java process average around 8 - 10 GB. All searchers have final index size of 9 GB. So in total there are 9X10 (shards) = 90GB worth of index files. Please NOTE that we have tried 15 minute soft commit setting and 30 minutes hard commit settings. Same time settings for both, 30 minute soft commit and an hour hard commit setting.Vijay Sekhri
I am sorry I had wrong information posted. I posted our DEV env configuration by mistake. After double checking our stress and Prod Beta env where we have found the original issue, I found all the searchers have around 50 GB of RAM available and two instances of JVM running (2 different ports). Both instances have 12 GB allocated. The rest 26 GB is available for the OS. 1st instance on a host has search1 collection (live collection) and the 2nd instance on the same host has search2 collection (for full indexing ).Vijay Sekhri
Have you had any luck solving this @vijay? We are seeing similar problems with our solrcloud cluster, except our replicas go into recovery every time we optimize our collections.Simon Tower
@VijaySekhri were you able to solve this issue? facing a similar problem at our end.Bikas Katwal

1 Answers

1
votes

Garbage Collection pauses can exceed clientTimeout, resulting in the Zookeeper connection being broken, which causes an infinite cycle of recovery.

Frequent optimizes, commits, or updates, and poorly tuned segment merge configuration can result in excessive overhead when recovering. This overhead can cause a recovery loop.

Lastly, there seems to be some type of bug that can be encountered during recovery that our organization has experienced. It's rare but it seems to happen during times when network connections are flapping or unreliable. Zookeeper disconnects trigger a recovery, and the recovery spikes memory, sometimes this can even cause an out of memory condition.

Update BEWARE GRAPH QUERIES

Organization I work at experienced pauses from graph queries within Solr. The graph queries were apart of a type-ahead plugin/component. When someone submitted long strings for type-ahead, the graph query grew complex, and caused huge memory usage and gc pauses.