Here is the situation I am facing.
I am migrating from SOLR 4 to SOLR 7. SOLR 4 is running on Tomcat 8, SOLR 7 runs with built in Jetty 9. The largest core contains about 1,800,000 documents (about 3 GB).
The migration went through smoothly. But something's bothering me.
I have a PostFilter to collect only some documents according to a pre-selected list. Here is the code for the org.apache.solr.search.DelegatingCollector:
@Override
protected void doSetNextReader(LeafReaderContext context) throws IOException {
this.reader = context.reader();
super.doSetNextReader(context);
}
@Override
public void collect(int docNumber) throws IOException {
if (null != this.reader && isValid(this.reader.document(docNumber).get("customid")))
{
super.collect(docNumber);
}
}
private boolean isValid(String customId) {
boolean valid = false;
if (null != customMap) // HashMap<String, String>, contains the custom IDs to keep. Contains an average of 2k items
{
valid = customMap.get(customId) != null;
}
return valid;
}
And here is an example of query sent to SOLR:
/select?fq=%7B!MyPostFilter%20sessionid%3DWST0DEV-QS-5BEEB1CC28B45580F92CCCEA32727083&q=system%20upgrade
So, the problem is:
It runs pretty fast on SOLR 4, with average QTime equals to 30.
But now on SOLR 7, it is awfully slow with average QTime around 25000!
And I am wondering what can be the source of such bad performances...
With a very simplified (or should I say transparent) collect function (see below), there is no degradation. This test just to exclude server/platform from the equation.
@Override
public void collect(int docNumber) throws IOException {
super.collect(docNumber);
}
My guess is that since LUCENE 7, there have been drastic changes in the way the API access documents, but I am not sure to have understood everything. I got it from this post: How to get DocValue by document ID in Lucene 7+?
I suppose this has something to do with the issues I am facing. But I have no idea how to upgrade/change my PostFilter and/or DelegatingCollector to go back to good performances.
If any LUCENE/SOLR experts could provide some hints or leads, it would be very appreciated. Thanks in advance.
PS: In the core schema:
<field name="customid" type="string" indexed="true" stored="true" required="true" multiValued="false" />
This field is string-type as it can be something like "100034_001".
In the solrconfig.xml:
<queryParser name="MyPostFilter" class="solrpostfilter.MyQueryPaser"/>
I can share the full schema and solrconfig.xml files if needed but so far, there is no other particular configuration in there.
EDIT
After some digging in the API, I changed the collect function with the following:
@Override
public void collect(int docNumber) throws IOException {
if (null != reader)
{
SortedDocValues sortedDocValues = reader.getSortedDocValues("customid");
if (sortedDocValues.advanceExact(docNumber) && isValid(sortedDocValues.binaryValue().utf8ToString()))
{
super.collect(docNumber);
}
}
}
Now QTime is down to an average of 1100, which is much, much better but still far from the 30 I had with SOLR 4.
Not sure it is possible to improve this even more, but any other advice/comment is still very welcome. /cheers
this.reader.document(docNumber).get("customid"))
part that consumes most of the processing time. And when running the query in debug mode, all the time is spent in the process.query part. – Lucas