1
votes

I have an index of people based on text documents that they have authored. This is the field type:

    <fieldtype name="TField" class="solr.TextField" positionIncrementGap="100" omitNorms="true">
        <analyzer>
           <tokenizer class="solr.StandardTokenizerFactory"/>
           <filter class="solr.LowerCaseFilterFactory" />
           <filter class="solr.StopFilterFactory" />
           <filter class="solr.PorterStemFilterFactory" />
           <filter class="solr.PositionFilterFactory" />
        </analyzer>
    </fieldtype>

And the field declaration itself:

    <field name="Publication"             type="TField"           indexed="true"          stored="true"   multiValued="true" />

And the request handler config:

    <requestHandler name="/select/" class="solr.StandardRequestHandler" default="true" >
        <lst name="defaults" >
          <str name="defType">edismax</str>
          <str name="qf">Publication</str>
          <str name="fl">ID,score</str>
          <str name="q.alt">*:*</str>
          <str name="rows">10</str>
        </lst>
    </requestHandler>

The ideal scenario is to process a text query and return people scored on how many Publications that text query matches. For example:

Person A has three documents with content "cat dog mouse", "cat dog house", "banana"

Person B has three documents with content "cat dog mouse", "cat", "dog"

Person C has two documents with content "cat", "dog", "banana"

If the text query is "cat dog", i would like Person A to be top with score 2 (matching "cat dog mouse", "cat dog house"), Person B to be second with score 1 (matching "cat dog mouse") and Person C to not be returned at all.

My current implementation fails to do this on two fronts. First of all, it returns Person C, since Solr seems to munge the contents of the multiValued field into one so the fact that cat and dog appear in separate Publication entries for Person A doesn't seem to matter.

Second, strongly related to the first, the scoring of the documents is TF-IDF, based on the content of the concatenation of all values in the Publication field. Therefore, person A and B end up having the same score, since cat and dog appear the same number of times in their whole document corpus.

Is there any way to achieve what I am looking for? More generally, is there any way to score documents based on matching individual entries of a multiValued field instead of taking all the entries as a whole?

1

1 Answers

1
votes

After a whole lot of googling, it would seem that for scoring and retrieval purposes, having multiple entries in a multiValued field and a single entry that is the concatenation of the values is equivalent. We have partially solved the problem for our particular case by creating an index of the authored documents themselves, then searching over that index and faceting for authors. This yields a list of authors ordered by the number of relevant documents they have authored. This solution is by no means perfect and has a number of problems, such as not knowing the total # of results available (since you cannot count the # of entries for a facet), or not being able to perform more sophisticated filtering on the authors.

Thought I'd share my dead end.