4
votes

Here is the situation I am facing.

I am migrating from SOLR 4 to SOLR 7. SOLR 4 is running on Tomcat 8, SOLR 7 runs with built in Jetty 9. The largest core contains about 1,800,000 documents (about 3 GB).

The migration went through smoothly. But something's bothering me.

I have a PostFilter to collect only some documents according to a pre-selected list. Here is the code for the org.apache.solr.search.DelegatingCollector:

@Override
protected void doSetNextReader(LeafReaderContext context) throws IOException {
    this.reader = context.reader();
    super.doSetNextReader(context);
}

@Override
public void collect(int docNumber) throws IOException {
    if (null != this.reader && isValid(this.reader.document(docNumber).get("customid")))
    {
        super.collect(docNumber);
    }
}

private boolean isValid(String customId) {
    boolean valid = false;
    if (null != customMap) // HashMap<String, String>, contains the custom IDs to keep. Contains an average of 2k items
    {
        valid = customMap.get(customId) != null;
    }

    return valid;
}

And here is an example of query sent to SOLR:

/select?fq=%7B!MyPostFilter%20sessionid%3DWST0DEV-QS-5BEEB1CC28B45580F92CCCEA32727083&q=system%20upgrade

So, the problem is:

It runs pretty fast on SOLR 4, with average QTime equals to 30.

But now on SOLR 7, it is awfully slow with average QTime around 25000!

And I am wondering what can be the source of such bad performances...

With a very simplified (or should I say transparent) collect function (see below), there is no degradation. This test just to exclude server/platform from the equation.

@Override
public void collect(int docNumber) throws IOException {
    super.collect(docNumber);
}

My guess is that since LUCENE 7, there have been drastic changes in the way the API access documents, but I am not sure to have understood everything. I got it from this post: How to get DocValue by document ID in Lucene 7+?

I suppose this has something to do with the issues I am facing. But I have no idea how to upgrade/change my PostFilter and/or DelegatingCollector to go back to good performances.

If any LUCENE/SOLR experts could provide some hints or leads, it would be very appreciated. Thanks in advance.

PS: In the core schema:

<field name="customid" type="string" indexed="true" stored="true" required="true" multiValued="false" />

This field is string-type as it can be something like "100034_001".

In the solrconfig.xml:

<queryParser name="MyPostFilter" class="solrpostfilter.MyQueryPaser"/>

I can share the full schema and solrconfig.xml files if needed but so far, there is no other particular configuration in there.

EDIT

After some digging in the API, I changed the collect function with the following:

@Override
public void collect(int docNumber) throws IOException {
    if (null != reader)
    {
        SortedDocValues sortedDocValues = reader.getSortedDocValues("customid");
        if (sortedDocValues.advanceExact(docNumber) && isValid(sortedDocValues.binaryValue().utf8ToString()))
        {
            super.collect(docNumber);
        }
    }
}

Now QTime is down to an average of 1100, which is much, much better but still far from the 30 I had with SOLR 4.

Not sure it is possible to improve this even more, but any other advice/comment is still very welcome. /cheers

2
Have you tried attaching a profiler and looked at where the time is spent? It should give you a decent idea of what the root cause of the issue is.MatsLindh
Thanks for the suggestion. I am not fully able to profile the VM, so I tried to hunt things down with logs... And it is really the this.reader.document(docNumber).get("customid")) part that consumes most of the processing time. And when running the query in debug mode, all the time is spent in the process.query part.Lucas
I have edited the question with new piece of code for the collect function. Things go faster but it's still slower than previous version of SOLR/Lucene.Lucas
I have tried to reproduce your issue, without success. Please check my sources: github.com/cheffe/solr-postfilter-sample alternatively post a reduced sample with the issue you have.cheffe
Thanks for having a look into this. Your sources look pretty much the same as mine actually. I see in your test that you're adding 180,000 documents. Can you retry with 1,800,000 documents? (that's about how many I have in my Solr core... all the index folder weights about 3 GB).Lucas

2 Answers

2
votes

Use a filter query instead of a post filter.

This answer does not attempt to increase the performance of the post filter, but uses a different approach. Nevertheless I got way better results (factor 10) than by any improvements made to the post-filter.

Checkout my code here: https://github.com/cheffe/solr-postfilter-sample

increase maxBooleanClauses

Visit your solrconfig.xml. There add or adjust the <query> ... </query> element to contain a child-element maxBooleanClauses with a value of 10024.

<query>
  <!-- other content left out --->
  <maxBooleanClauses>10024</maxBooleanClauses>
</query>

This will allow you to add a large filter query instead of a post filter.

add all customids as filter query

This query got huge, but the performance was just way better.

fq=customid:(0_001 1_001 2_001 3_001 4_001 5_001 ... 4999_001 5000_001)

comparison of execution time in contrast to post filter

The post filter took for 5.000 ids 320ms the filter query in contrast took 22ms for the same amount of ids.

2
votes

Following the advice of Toke Eskildsen on Solr's user mailing list in a thread that is quite similar to you question, I got the response time down from 300 ms to 100 ms. Feel free to link my github repository to the mailing list. Maybe they have further advise.

These measures were the most effective

  • store the reference to the SortedDocValues during doSetNextReader
  • use org.apache.lucene.index.DocValues to get the above
  • preprocess the given String objects to org.apache.lucene.util.BytesRef during the parsing of the Query
public DelegatingCollector getFilterCollector(IndexSearcher searcher) {
  return new DelegatingCollector() {

    private SortedDocValues sortedDocValues; 

    @Override
    protected void doSetNextReader(LeafReaderContext context) throws IOException {
      super.doSetNextReader(context);
      // store the reference to the SortedDocValues 
      // use org.apache.lucene.index.DocValues to do so
      sortedDocValues = DocValues.getSorted(context.reader(), "customid");
    }

    @Override
    public void collect(int docNumber) throws IOException {
      if (sortedDocValues.advanceExact(docNumber) && isValid(sortedDocValues.binaryValue())) {
        super.collect(docNumber);
      }
    }

    private boolean isValid(BytesRef customId) {
      return customSet.contains(customId);
    }

  };
}

Within the extension of the QParserPlugin I convert the given String to org.apache.lucene.util.BytesRef.

@Override
public QParser createParser(String qstr, SolrParams localParams, SolrParams params, SolrQueryRequest req) {
  return new QParser(qstr, localParams, params, req) {

    @Override
    public Query parse() throws SyntaxError {
      int idCount = localParams.getInt("count", 2000);
      HashSet<BytesRef> customSet = new HashSet<>(idCount);
      for (int id = 0; id < idCount; id++) {
        String customid = id % 200000 + "_" +  String.format ("%03d", 1 + id / 200000);
        customSet.add(new BytesRef(customid));
      }

      return new IdFilter(customSet);
    }
  };
}