I'm dealing with an existing web platform that uses SOLR to generate query-based datasets. We have an issue with near real-time (< 1 minute) publishing of new content. There is a caching mechanism in place to help reduce resource load on the SOLR servers, but this caching introduces a lag time in the appearance of new content in SOLR-query based datasets.
I'd like to be able to invalidate the cache based on the SOLR query that generated a cached item, but I've run into a stumbling block: with 1000+ SOLR queries, it's difficult to know which (if any) of them apply to a given document. The approaches we've identified so far include:
- Instantiate a SOLR instance, push a single document in at a time, and run the queries to see which hit.
- Build an in-memory Lucene index, and do the same thing.
- Use some other technique (hand-rolled parsing of the SOLR query) to get a rough estimate of which queries are affected.
None of these is really ideal, but without some way to "turn around" the process and run the document through the queries CEP style, I'm not sure there's a better way.
Has anyone dealt with a similar situation?