0
votes

I am using the Solr Admin UI to build this query:

http://localhost:8983/solr/gencat.imagemetadata/select?q=id:"TH-1961-46483-10968-9"&wt=json&indent=true&facet=true&facet.field=externalid

It returns:

{
  "response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "id": "TH-1961-46483-10968-9",
        "externalid": "100700000_00024"
      }
    ]
  },
  "facet_counts": {
    "facet_queries": {},
    "facet_fields": {
      "externalid": [
        "100700000_00024",
        1,
        "005471837_00001",
        0,
        "005471837_00002",
        0,
        "005471837_00003",
        0,
        "005471837_00099",
        0,
        ....
      ]
    }
  }
}

My assumption was it was only going to return facet counts for the one document it found (since I’m specifying the id I want). Instead, it returns a facet_counts structure with every externalid value indexed by Solr (granted…all but one entry is 0. The externalid count for the document matching the query is 1 as it aught to be). But I only want Solr facet counts for the documents in the search results. Not everything. It slows down the query significantly.

Yes, I can set facet.mincount = 1 to cause it to only return facet counts that actually have counts, but under the covers it still looks like it is looking at all of the documents…not just the queried result set. It is currently taking 2 minutes to execute the query above on our 2+ Billion items.

When I turn tracing on; in cqlsh I can see that it is processing across all 2+ Billion items. If it were to only count over the result set this query would be much, much faster.

externalid is defined like this in the schema file:

<field docValues="true" indexed="true" multiValued="false" name="externalid" stored="true" type="StrField"/>

What am I misunderstanding? It is slowing down my query by having to go out and find all of the externalid’s just to say they have a count of 0.

Is there a way to tell Solr faceting to only look at the docs found from the query?

I am on Solr 6 under DSE 6.0

1
Faceting is done after the query is done (as Mats writes "iterating over documents that match the query"), so your assumption is correct. Could it be that your query itself is taking very long and that the poor performance has nothing to do with faceting?Jack Miller

1 Answers

0
votes

You can give the facet method through the facet.method parameter. fc is the default, and this is the behavior you're looking for - are you sure that DSE are actually using fc as the method by default? (since the definition for fc is that it should only iterate over documents matching the query):

fc

Calculates facet counts by iterating over documents that match the query and summing the terms that appear in each document.

This is currently implemented using an UnInvertedField cache if the field either is multi-valued or is tokenized (according to FieldType.isTokened()). Each document is looked up in the cache to see what terms/values it contains, and a tally is incremented for each value.

This method is excellent for situations where the number of indexed values for the field is high, but the number of values per document is low. For multi-valued fields, a hybrid approach is used that uses term filters from the filterCache for terms that match many documents. The letters fc stand for field cache.