Solr- Find "Significant Terms" on Subset of Documents

Question

I'm trying to get "significant terms" for a subset of documents in Solr. This may or may not be the best way, but I'm currently attempting to use Solr's TF-IDF functionality since we have the data stored in Solr and it's lightning fast. I want to restrict the "DF" count to a subset of my documents, through a search or a filter. I tried this, where I'm searching for "apple" in the name field:

http://localhost:8983/solr/techproducts/tvrh?q=name:apple&tv.tf=true&tv.df=true&tv.tf_idf=true&indent=on&wt=json&rows=1000

and that of course, only gives me documents that have "apple" in the name, but my document frequency gives the counts from the entire dataset, which doesn't seem like what I want. I would think Solr can do this, but maybe not. I'm open to suggestions.

Thanks, Adrian

Alessandro Benedetti Alessandro Benedetti · Accepted Answer · 2017-07-26T10:36:00

It is one the works I have in my backlog[1].

What you need is actually the document frequency in your foreground set ( your subset of docs) and the document frequency in your background set(your corpus). Solr won't do that out of the box, but you can work on it. Elastic Search has a module for that you can inspiration from[2]

[1] https://issues.apache.org/jira/browse/SOLR-9851

[2] https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-significantterms-aggregation.html

Solr- Find "Significant Terms" on Subset of Documents

1 Answers