To index my website, I have a Ruby script that in turn generates a shell script that uploads every file in my document root to Solr. The shell script has many lines that look like this:
curl -s \
"http://localhost:8983/solr/update/extract?literal.id=/about/core-team/&commit=false" \
-F "myfile=@/extra/www/docroot/about/core-team/index.html"
...and ends with:
curl -s http://localhost:8983/solr/update --data-binary \
'<commit/>' -H 'Content-type:text/xml; charset=utf-8'
This uploads all documents in my document root to Solr. I use tika and ExtractingRequestHandler to upload documents in various formats (primarily PDF and HTML) to Solr.
In the script that generates this shell script, I would like to boost certain documents based on whether their id field (a/k/a url) matches certain regular expressions.
Let's say that these are the boosting rules (pseudocode):
boost = 2 if url =~ /cool/
boost = 3 if url =~ /verycool/
# otherwise we do not specify a boost
What's the simplest way to add that index-time boost to my http request?
I tried:
curl -s \
"http://localhost:8983/solr/update/extract?literal.id=/verycool/core-team/&commit=false" \
-F "myfile=@/extra/www/docroot/verycool/core-team/index.html" \
-F boost=3
and:
curl -s \
"http://localhost:8983/solr/update/extract?literal.id=/verycool/core-team/&commit=false" \
-F "myfile=@/extra/www/docroot/verycool/core-team/index.html" \
-F boost.id=3
Neither made a difference in the ordering of search results. What I want is for the boosted results to come first in search results, regardless of what the user searched for (provided of course that the document contains their query).
I understand that if I POST in XML format I can specify the boost value for either the entire document or a specific field. But If I do that, it isn't clear how to specify a file as the document contents. Actually, the tika page provides a partial example:
curl "http://localhost:8983/solr/update/extract?literal.id=doc5&defaultField=text" \
--data-binary @tutorial.html -H 'Content-type:text/html'
But again it isn't clear where/how to specify my boost. I tried:
curl \
"http://localhost:8983/solr/update/extract?literal.id=mydocid&defaultField=text&boost=3"\
--data-binary @mydoc.html -H 'Content-type:text/html'
and
curl \
"http://localhost:8983/solr/update/extract?literal.id=mydocid&defaultField=text&boost.id=3"\
--data-binary @mydoc.html -H 'Content-type:text/html'
Neither of which altered search results.
Is there a way to update just the boost attribute of a document (not a specific field) without altering the document contents? If so, I could accomplish my goal in two steps: 1) Upload/index document as I have been doing 2) Specify boost for certain documents