Solr irregularly having QTime > 50000ms

Question

I have an installation of solr-4.8.1 under development. I have a script that runs the query id:* once a minute and extracts QTime and numFound. The result looks like follows:

2014-07-07:09:30:12 QTime=10 numFound="6095776"
2014-07-07:09:32:02 QTime=50022 numFound="6095776"
2014-07-07:09:33:02 QTime=30 numFound="6095776"
2014-07-07:09:34:02 QTime=19 numFound="6095776"
2014-07-07:09:35:02 QTime=10 numFound="6095776"
2014-07-07:09:36:52 QTime=50029 numFound="6095776"
2014-07-07:09:37:52 QTime=26 numFound="6095776"
2014-07-07:09:38:52 QTime=12 numFound="6095776"
2014-07-07:09:39:52 QTime=11 numFound="6095776"
2014-07-07:09:40:52 QTime=12 numFound="6095776"

As you can see, once in a while the QTime is >50 seconds.

Further observations:

First I thought it might correlate with the autocommit interval of 5 Minutes, but the times between the long QTimes are irregular, not regularly every 5 minutes. Further, I am not feeding anything.
Apart from the query once a minute (or rather, with a delta of one minute between queries), the machine does absolutely nothing. There is no query load.
The behavior started already when we had far less documents indexed.

Any ideas where to look?

EDIT:

I took a closer look at the logs and find strange things. All 10 core (all on the same machine), report a QTime of 0 or 1, but one of them logs this 50 seconds after the other 9. (loglines cut for brevity)

10:53:55.452 [core000] params={NOW=1404723235442&shard.url=http://x.x.x.x/solr/core000/&df=content&fl=id,score&start=0&q=id:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=0} hits=610796 status=0 QTime=1 
10:53:55.452 [core005] params={NOW=1404723235442&shard.url=http://x.x.x.x/solr/core005/&df=content&fl=id,score&start=0&q=id:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=0} hits=609562 status=0 QTime=0 
10:53:55.453 [core002] params={NOW=1404723235442&shard.url=http://x.x.x.x/solr/core002/&df=content&fl=id,score&start=0&q=id:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=0} hits=610392 status=0 QTime=1 
10:53:55.453 [core004] params={NOW=1404723235442&shard.url=http://x.x.x.x/solr/core004/&df=content&fl=id,score&start=0&q=id:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=0} hits=609186 status=0 QTime=0 
10:53:55.452 [core001] params={NOW=1404723235442&shard.url=http://x.x.x.x/solr/core001/&df=content&fl=id,score&start=0&q=id:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=0} hits=608294 status=0 QTime=0 
10:53:55.454 [core003] params={NOW=1404723235442&shard.url=http://x.x.x.x/solr/core003/&df=content&fl=id,score&start=0&q=id:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=0} hits=612155 status=0 QTime=1 
10:53:55.455 [core006] params={NOW=1404723235442&shard.url=http://x.x.x.x/solr/core006/&df=content&fl=id,score&start=0&q=id:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=0} hits=608606 status=0 QTime=1 
10:53:55.456 [core009] params={NOW=1404723235442&shard.url=http://x.x.x.x/solr/core009/&df=content&fl=id,score&start=0&q=id:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=0} hits=608956 status=0 QTime=1 
10:53:55.458 [core008] params={NOW=1404723235442&shard.url=http://x.x.x.x/solr/core008/&df=content&fl=id,score&start=0&q=id:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=0} hits=608847 status=0 QTime=1 
10:54:45.469 [core007] params={NOW=1404723235442&shard.url=http://x.x.x.x/solr/core007/&df=content&fl=id,score&start=0&q=id:*&distrib=false&wt=javabin&isShard=true&fsv=true&version=2&rows=0} hits=608982 status=0 QTime=1 
10:54:45.470 [core000] params={start=0&q=id:*&wt=xml&rows=0} hits=6095776 status=0 QTime=50028

And it seems to always be exactly 50 seconds (plus some minimal overhead), as if there is some timeout involved, which is exactly 50s. This makes it hard to believe it is the garbage collector (as proposed in one answer and which I will nevertheless lookk after), in particular because it happens independent of the index content.

EDIT: I seem to have found a complete crazy pseudo-solution: trying to figure out what is going on I attached strace to the solr-process. Since then the 50s hangs did not appear anymore.

MatsLindh MatsLindh · Accepted Answer · 2014-07-07T08:43:59

Large spikes in query time usually originates from GC pauses in the container VM for Solr, or (more seldom) imminent hardware failure (where a read from a disk fails). The latter should be indicated in your syslog.

The former is however a broad question, where the amount of memory available, memory dedicated to Solr, load etc. all are contributing factors. A good place to start would be Solr GC Pause Problems on the wiki. The original author of that page also has a guide for his own settings, although your results will vary depending on query profile and available RAM. You can ask the VM to log GC issues to a file, where the type of GC and the total time it runs is shown. If you correlate any extensive GC with when you're seeing the delays in query time, it might give a hint if GC is in fact an issue.

Solr irregularly having QTime > 50000ms

1 Answers