I have an index of people based on text documents that they have authored. This is the field type:
<fieldtype name="TField" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.StopFilterFactory" />
<filter class="solr.PorterStemFilterFactory" />
<filter class="solr.PositionFilterFactory" />
</analyzer>
</fieldtype>
And the field declaration itself:
<field name="Publication" type="TField" indexed="true" stored="true" multiValued="true" />
And the request handler config:
<requestHandler name="/select/" class="solr.StandardRequestHandler" default="true" >
<lst name="defaults" >
<str name="defType">edismax</str>
<str name="qf">Publication</str>
<str name="fl">ID,score</str>
<str name="q.alt">*:*</str>
<str name="rows">10</str>
</lst>
</requestHandler>
The ideal scenario is to process a text query and return people scored on how many Publications that text query matches. For example:
Person A has three documents with content "cat dog mouse", "cat dog house", "banana"
Person B has three documents with content "cat dog mouse", "cat", "dog"
Person C has two documents with content "cat", "dog", "banana"
If the text query is "cat dog", i would like Person A to be top with score 2 (matching "cat dog mouse", "cat dog house"), Person B to be second with score 1 (matching "cat dog mouse") and Person C to not be returned at all.
My current implementation fails to do this on two fronts. First of all, it returns Person C, since Solr seems to munge the contents of the multiValued field into one so the fact that cat and dog appear in separate Publication entries for Person A doesn't seem to matter.
Second, strongly related to the first, the scoring of the documents is TF-IDF, based on the content of the concatenation of all values in the Publication field. Therefore, person A and B end up having the same score, since cat and dog appear the same number of times in their whole document corpus.
Is there any way to achieve what I am looking for? More generally, is there any way to score documents based on matching individual entries of a multiValued field instead of taking all the entries as a whole?