Trying to get distinct field values from search in Solr

Question

I have Solr 4.10.0 and I have performed indexing for some books. The schema documents are every book's pages, so every document has fields such as, PageID, BookID, PageNum, Content, etc. The fields definition in the schema.xml is like the following:

<field name="PageID" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 

   <field name="Content" type="text_ar" indexed="true" stored="true" required="true" termVectors="true" />
   <field name="PageNum" type="int" indexed="false" stored="true" required="false" multiValued="false" />
   <field name="Part" type="int" indexed="false" stored="true" required="false" multiValued="false" />

   <field name="BookID" type="string" indexed="true" stored="true" required="true" multiValued="false" />
   <field name="BookTitle" type="text_ar" indexed="true" stored="true" required="true" />
   <field name="BookInfo" type="text_ar" indexed="true" stored="true" required="true" />
   <field name="BookCat" type="int" indexed="false" stored="true" required="false" multiValued="false" />

The problem

When I try searching the field Content which contains pages text, I will have multiple results from the same Book. It is clear that is expected because a certain word may be found in many pages of a book. I tried to make SQL DISTINCT like queries like the following:

Using facet

http: //localhost:8080/solr/books/select/?q=Content:WordOfSearch&sort=PageID%20desc&version=2.2&start=0&rows=10&indent=on&wt=json&facet=on&facet.field=BookID&facet.limit=1&hl=true&hl.q=Content:WordOfSearch

In the previous query I set facet.field=BookID to make results have only one result with the same book. However, this solution does not work as expected and it returned results as like facet is not used. i.e there is no change with using facet or not.

Using group I used it with and without the parameter main like the following:

http: //localhost:8080/solr/books/select/?q=Content:WordOfSearch&sort=PageID%20desc&version=2.2&start=0&rows=10&indent=on&wt=json&group=true&group.field=BookID&group.main=true&hl=true&hl.fl=*&hl.simple.pre=&hl.simple.post=<%2Fspan>

The group partially solved the problem. i.e from each book contents -pages- that contains the WordOfSearch it returns one result. However, it corrupts the pagination that I did in my application. In the application I depend on response: numFound to maintain the total records. In group solution that I have used, it returns numFound equals to the number found of a query without group. i.e it returns the number of documents with repeated BookID values, so it leads to in empty pages at the last of paging. So, How could I get the exact number returned documents with group? or any other solution for my problem with repeated BookID field values.

Alexandre Rafalovitch Alexandre Rafalovitch · Accepted Answer · 2014-12-19T16:55:52

It sounds like you are trying to find the list of books that contain pages with the keywords you want. And that you don't care about the specific pages.

In which case, you may want to have separate set of documents representing books (as opposed to just pages) and use Join Query Parser to do the search.

Trying to get distinct field values from search in Solr

1 Answers