We've been having a number of problems with our solr search engine in our test environments. We have a solr cloud setup on version 4.6, single shard, 4 nodes. We see the CPU flat lines to 100% on the leader node for several hours, then the server starts to throw OutOfMemory errors, 'PERFORMANCE WARNING: Overlapping onDeckSearchers' starts appearing in the logs, the leaders enter recovery mode, the filter cache and query cache warmup times hit around 60 seconds (normally less than 2 secs), the leader node goes down, and we suffer a outage for the whole cluster for a few mins while it recovers and elects a new leader. We think we're hitting a number of solr bugs with the 4.6 and 4.x branch, and so are looking to move to 5.3. We also recently dropped our soft commit time down from 10 mins to 2 mins. I am seeing regular CPU spikes every 2 mins on all nodes, but the spikes are low, from 20-50% (max 100) on a 2 min cycle. When CPU's maxed out obviously I can't see those spikes. Hard commits are every 15 seconds, with opennewsearcher set to false. We have a heavy query and index load type of scenario.
I am wondering whether the frequent soft commits are having a significant effect on this issue, or whether the long auto warm times on the caches are caused by the other issues we are experiencing (cause or symptom)? We recently increased the indexing load on the server, but we need to address these issues in the test environment before we can promote to production.
Cache settings:
<filterCache class="solr.FastLRUCache"
size="5000"
initialSize="5000"
autowarmCount="1000"/>
<queryResultCache class="solr.LRUCache"
size="20000"
initialSize="20000"
autowarmCount="5000"/>