5
votes

To index my website, I have a Ruby script that in turn generates a shell script that uploads every file in my document root to Solr. The shell script has many lines that look like this:

  curl -s \
 "http://localhost:8983/solr/update/extract?literal.id=/about/core-team/&commit=false" \
 -F "myfile=@/extra/www/docroot/about/core-team/index.html"

...and ends with:

curl -s http://localhost:8983/solr/update --data-binary \
'<commit/>' -H 'Content-type:text/xml; charset=utf-8'

This uploads all documents in my document root to Solr. I use tika and ExtractingRequestHandler to upload documents in various formats (primarily PDF and HTML) to Solr.

In the script that generates this shell script, I would like to boost certain documents based on whether their id field (a/k/a url) matches certain regular expressions.

Let's say that these are the boosting rules (pseudocode):

boost = 2 if url =~ /cool/
boost = 3 if url =~ /verycool/
# otherwise we do not specify a boost

What's the simplest way to add that index-time boost to my http request?

I tried:

curl -s \
 "http://localhost:8983/solr/update/extract?literal.id=/verycool/core-team/&commit=false" \
 -F "myfile=@/extra/www/docroot/verycool/core-team/index.html" \
 -F boost=3

and:

curl -s \
 "http://localhost:8983/solr/update/extract?literal.id=/verycool/core-team/&commit=false" \
 -F "myfile=@/extra/www/docroot/verycool/core-team/index.html" \
 -F boost.id=3

Neither made a difference in the ordering of search results. What I want is for the boosted results to come first in search results, regardless of what the user searched for (provided of course that the document contains their query).

I understand that if I POST in XML format I can specify the boost value for either the entire document or a specific field. But If I do that, it isn't clear how to specify a file as the document contents. Actually, the tika page provides a partial example:

curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" \
--data-binary @tutorial.html -H 'Content-type:text/html'

But again it isn't clear where/how to specify my boost. I tried:

curl \ 
"http://localhost:8983/solr/update/extract?literal.id=mydocid&defaultField=text&boost=3"\
--data-binary @mydoc.html -H 'Content-type:text/html'

and

curl \ 
"http://localhost:8983/solr/update/extract?literal.id=mydocid&defaultField=text&boost.id=3"\
--data-binary @mydoc.html -H 'Content-type:text/html'

Neither of which altered search results.

Is there a way to update just the boost attribute of a document (not a specific field) without altering the document contents? If so, I could accomplish my goal in two steps: 1) Upload/index document as I have been doing 2) Specify boost for certain documents

1

1 Answers

3
votes

To index a document in Solr, you have to POST it to the /update handler. The documents to index are put in the body of the POST request. In general, you have to use the xml format format of Solr. Using that xml, you can add a boost value to a specific field or to a whole document.