1
votes

I have few near duplicate documents stored in solr. Schema has a autogenerated uuid as the unique key so duplicates can get into the index. I need to get the counts of duplicated documents based on field/fields in the schema.

I am trying to get quick numbers without writing a client program and going through the full result set, something on solr console itself. Tried to use facets but not able to get the total counts. below query gives the duplicates for each value of 'idfield' but they need to be iterated till last page and summed up (over couple of million entries).

q=*:*&facet=true&facet.mincount=2&facet.field=idfield

2

2 Answers

1
votes

jason facet query can be used to find out unique values as explained in this blog http://yonik.com/solr-count-distinct/

or it can be done using collapse filter and finding the difference q=*:*&fq={!collapse=true field=idfield} - get the numfound and subtract from MatchAllDocs query (*:*)

0
votes

You can also use facet.mincount=2 to get duplicate documents by faceting on unique id field. Ex: /solr/core/select?q=:&facet=on&facet.field=uniqueidfield&facet.mincount=2&facet.missing=true Also you can add facet.limit=-1&rows=0 to get the document ids with duplicate ids.