1
votes

In Solr 4.10, I have 170.000.000 documents in 11 sharding cores. Each document represents a access in my website, since 2008, and each of the 11 cores represents an year.

I need to find the accesses of a list of items, so a make a query like bellow:

using facet.field, "QTime": 10557

(after cleaning cache by core reloads)

q=(owningItem:178350+OR+owningItem:51760+OR+owningItem:71585)+AND+statistics_type:view&shards=localhost:8080/solr//statistics-2014,localhost:8080/solr//statistics-2017,localhost:8080/solr//statistics-2016,localhost:8080/solr//statistics-2008,localhost:8080/solr//statistics-2011,localhost:8080/solr//statistics-2012,localhost:8080/solr//statistics-2010,localhost:8080/solr//statistics-2013,localhost:8080/solr//statistics-2009,localhost:8080/solr//statistics-2015,localhost:8080/solr//statistics&facet.limit=4&facet.field=owningItem&facet.mincount=1

The result:

 "facet_counts": {
    "facet_queries": {},
    "facet_fields": {
      "owningItem": [
        "51760",
        3502,
        "71585",
        1860
      ]
    },
    "facet_dates": {},
    "facet_ranges": {},
    "facet_intervals": {}
  },

When I debug this query, I can see, for each core, values of facet.field returned that don't belong to query results:

response={numFound=953,start=0,maxScore=1.9732983,docs=[]},sort_values={},facet_counts={facet_queries={},facet_fields={owningItem={51760=556,71585=397,**1=0,10=0,100=0,1000=0,10000=0,100000=0,100001=0,100002=0,100003=0,100004=0,100005=0,100007=0,100008=0,10001=0**}},facet_dates={},facet_ranges={},facet_intervals={}}

So, I tried to use facet.query instead facet.field

using facet.query, "QTime": 1346

q=(owningItem:178350+OR+owningItem:51760+OR+owningItem:71585)+AND+statistics_type:view&shards=localhost:8080/solr//statistics-2014,localhost:8080/solr//statistics-2017,localhost:8080/solr//statistics-2016,localhost:8080/solr//statistics-2008,localhost:8080/solr//statistics-2011,localhost:8080/solr//statistics-2012,localhost:8080/solr//statistics-2010,localhost:8080/solr//statistics-2013,localhost:8080/solr//statistics-2009,localhost:8080/solr//statistics-2015,localhost:8080/solr//statistics&facet.limit=4&facet.query=owningItem:178350&facet.query=owningItem:51760&facet.query=owningItem:71585&facet.mincount=1

 "facet_counts": {
    "facet_queries": {
      "owningItem:178350": 0,
      "owningItem:51760": 3502,
      "owningItem:71585": 1860
    },
    "facet_fields": {},
    "facet_dates": {},
    "facet_ranges": {},
    "facet_intervals": {}
  },

And debug, just with items that belong to results:

response={numFound=953,start=0,maxScore=1.9732983,docs=[]},sort_values={},facet_counts={facet_queries={owningItem:178350=0,owningItem:51760=556,owningItem:71585=397},facet_fields={},facet_dates={},facet_ranges={},facet_intervals={}}

I concluded that facet.field is being calculate over more than results of Solr query. However I think that this conclusion is not write.

My questions:

  • Why facet.query is faster than facet.field?

  • Is really Solr calculating facet.field over documents that don't belong to query results?

1

1 Answers

0
votes

Since you're running in a sharded environment, each shard has to return more items than the current facet.limit tells it to do. The reason for this is that these facets may have a higher score in one of the other shards. They're not being calculated over documents that does not belong to the query set (then they wouldn't have been 0). Faceting uses the list of indexed terms in the background as well, since facet queries can be used to return terms even if the count is 0.

I.e. shard 1 & 2 both has foo as the second most popular shard with 30 hits on each shard, while shard 1 has baz as most popular with 31, but no documents with bar, shard 2 has bar has most popular with 31, but no documents with baz. If facet.limit was set to 1 and only that number of facets were returned, foo would never be returned (since it's the most popular overall, but not in any of the shards).

This also tells you why there are values included from each server where the mincount is below the requested one. In our previous example, if mincount was set to 31, and that parameter was propagated to each shard, foo would never be returned from the shards. That's why the mincount is evaluated after the end list of facets has been returned. In your case these facets are just those that sort first with 0 hits, but that's a special case (since 0 doesn't contribute anything to the end result, but returning those from the start of the list doesn't do anything either, since the terms has been retrieved and their score has been calculated as well).

You can control how Solr performs overrequesting for facets by adjusting facet.overrequest.count (10) and facet.overrequest.ratio (1.5).

In these situations, each shard is by default asked for the top "facet.overrequest.count + (facet.overrequest.ratio * facet.limit)" constraints.

When you're using facet queries, neither of this has to take place. Each query is run on each server, and the counts for those queries are merged before returning to the user. There is no worry that a facet.query could return hits that weren't returned on other nodes, etc. In our example about, the queries would return 30 + 30, 31 + 0 and 31 + 0. But you only get statistics about the terms you already know about, and not for those that could be relevant - but that you're not querying for. That's the difference.