1
votes

I'm dealing with an existing web platform that uses SOLR to generate query-based datasets. We have an issue with near real-time (< 1 minute) publishing of new content. There is a caching mechanism in place to help reduce resource load on the SOLR servers, but this caching introduces a lag time in the appearance of new content in SOLR-query based datasets.

I'd like to be able to invalidate the cache based on the SOLR query that generated a cached item, but I've run into a stumbling block: with 1000+ SOLR queries, it's difficult to know which (if any) of them apply to a given document. The approaches we've identified so far include:

  1. Instantiate a SOLR instance, push a single document in at a time, and run the queries to see which hit.
  2. Build an in-memory Lucene index, and do the same thing.
  3. Use some other technique (hand-rolled parsing of the SOLR query) to get a rough estimate of which queries are affected.

None of these is really ideal, but without some way to "turn around" the process and run the document through the queries CEP style, I'm not sure there's a better way.

Has anyone dealt with a similar situation?

2

2 Answers

1
votes

Solr emits ETags for all query responses, and honors standard HTTP cache request headers like If-None-Match, If-Match, etc. See Solr And HTTP Caches

So it's a matter of coordinating your cache system around this.

0
votes

I think the standard way is to make an "index" out of the single changed document (using a memory index). You then run your thousands of queries on this index, and if the query matches, you invalidate the cache for that query. Since the index is so small and is entirely in memory, it's very fast.