2
votes

I have implemented solr 6.5.1 today in my debian server but I have trouble getting the pdf text content. The searching is ok, because the document appears ok in when I query for example my name: "juan". However, the does not appear with each str result how it supposed to do.

This is the example query:

http://localhost:8983/solr/ex/select?q=juan&fl=title&wt=xml&hl=true&hl.snippets=20&hl.fl=content&hl.usePhraseHighlighter=true

And this is the result:

<response>
    <lst name="responseHeader">
        <int name="status">0</int>
        <int name="QTime">1</int>
        <lst name="params">
            <str name="hl.snippets">20</str>
            <str name="q">juan</str>
            <str name="hl">true</str>
            <str name="fl">title</str>
            <str name="hl.usePhraseHighlighter">true</str>
            <str name="hl.fl">content</str>
            <str name="wt">xml</str>
        </lst>
    </lst>
    <result name="response" numFound="1" start="0">
        <doc>
            <arr name="title">
                <str>CV_Juan_Jara_ultimo</str>
            </arr>
        </doc>
    </result>
    <lst name="highlighting">
        <lst name="/solr-6.5.1/mydocs/CV_Juan_Jara_ultimo.pdf"/>
    </lst>
</response>

Additionally, the log is showing all the pdf text, so I assume it was correctly indexed (I indexed the pdf using the command: bin/post -c ex mydocs/CV_Juan_Jara_ultimo.pdf).

I added the "content" field to the schema, using curl:

curl -X POST -H 'Content-type:application/json' --data-binary '{
  "add-field" : {
     "name":"text",
     "type":"text_general",
     "indexed":"true",
     "stored":"false",
     "multiValued":"true"
     }
}' localhost:8983/solr/ex/schema

Do you know what could be wrong ?

All that I want to do is search a topic in my pdf and then get all results highlighted like this:

http://www.codewrecks.com/blog/index.php/2013/05/27/hilight-matched-text-inside-documents-indexed-with-solr-plus-tika/

2

2 Answers

1
votes

It is a very common and simple mistake :

"stored":"false" should be "stored":"true" for the 'content' field.

Currently all the highlighters require the field to be stored to be used [1] .

[1] https://cwiki.apache.org/confluence/display/solr/Highlighting

1
votes

SOLVED: the solution that worked for me finally, was to replace the _text_ field in schema with this curl command:

curl -X POST -H 'Content-type:application/json' --data-binary '{
 "replace-field" : {
 "name":"_text_",
 "type":"text_general",
 "indexed":"true",
 "stored":"true",
 "multiValued":"true"
 }
}' http://localhost:8983/solr/ex/schema

This is because the _text_ field comes with "stored":"false" by default.

NOTE: Remember to indexing all files again to your core if you did it prior to this schema field replace