3
votes

We've been having a number of problems with our solr search engine in our test environments. We have a solr cloud setup on version 4.6, single shard, 4 nodes. We see the CPU flat lines to 100% on the leader node for several hours, then the server starts to throw OutOfMemory errors, 'PERFORMANCE WARNING: Overlapping onDeckSearchers' starts appearing in the logs, the leaders enter recovery mode, the filter cache and query cache warmup times hit around 60 seconds (normally less than 2 secs), the leader node goes down, and we suffer a outage for the whole cluster for a few mins while it recovers and elects a new leader. We think we're hitting a number of solr bugs with the 4.6 and 4.x branch, and so are looking to move to 5.3. We also recently dropped our soft commit time down from 10 mins to 2 mins. I am seeing regular CPU spikes every 2 mins on all nodes, but the spikes are low, from 20-50% (max 100) on a 2 min cycle. When CPU's maxed out obviously I can't see those spikes. Hard commits are every 15 seconds, with opennewsearcher set to false. We have a heavy query and index load type of scenario.

I am wondering whether the frequent soft commits are having a significant effect on this issue, or whether the long auto warm times on the caches are caused by the other issues we are experiencing (cause or symptom)? We recently increased the indexing load on the server, but we need to address these issues in the test environment before we can promote to production.

Cache settings:

<filterCache class="solr.FastLRUCache"
                 size="5000"
                 initialSize="5000"
                 autowarmCount="1000"/>

<queryResultCache class="solr.LRUCache"
                      size="20000"
                      initialSize="20000"
                      autowarmCount="5000"/>
1
This question would get much better answer on the mailing list as it is too specific for the StackOverflow. But yes, it does look like your soft commits cause warms up not finish by the time next commit happens. Though I would expect 2 minutes to be enough. Do you have document count threshold as well? Maybe you are triggering that one instead.Alexandre Rafalovitch
@AlexandreRafalovitch thanks. I will post there. What doc count threshold are you referring to? The autoWarmCount? I'll post the cache settingsSimon
I have seen this type of errors with 4.x and they were all gone after upgrading to 5.2.1. Basically with 4.x the way to get rid of it was to reduce the traffic.Calin Grecu
Have you been using autoCommit? I mean for example: in PHP, SolrClient::addDocument ( SolrInputDocument $doc [, bool $overwrite = true [, int $commitWithin = 0 ]] ); using $commitWithin you can auto commit this document after $commitWithin milliseconds.Dave

1 Answers

5
votes

We had this problem with Solr 4.10 (and, very rarely, 5.1). In our case, we were indexing quite frequently and commits were starting to become too close together. Sometimes our optimize command would run a bit longer than expected.

We solved it by making sure no indexing or commits occurred for at least ten minutes after the optimize operation started. We also auto warmed fewer queries for our caches. The following links will probably be useful to you if you haven't found them already:

Overlapping onDeckSearchers--Solr mailing list

The Solr Wiki