0
votes

I have to build an application where i have to search belong PDF,DOC,DOCX etc files. I would like to use Solr to index the entire directory that contains all my files and next search for word inside the documents.

Looking on the net i've seen that the faster ways is to use DIH. i've setup this inn this way:

solrConfig.xml

 <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>

      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

solr-data-config.xml

<dataConfig>  
    <dataSource type="BinFileDataSource" name="bin"/>
        <document>
            <entity name="sd" 
                    processor="FileListEntityProcessor"
                    baseDir="C:\Solr\solr-5.0.0\docs\myFolder\" 
                    fileName=".*\.(doc)|(pdf)|(docx)"
                    recursive="true"
                    rootEntity="false"
                    transformer="DateFormatTransformer">

                    <entity name="tika-test" processor="TikaEntityProcessor" url="${sd.fileAbsolutePath}"
                            format="text">
                            <field column="text" name="text"/>
                    </entity>

                    <field column="fileSize" name="size" />
                    <field column="file" name="filename" />

            </entity>
        </document> 
</dataConfig>  

When i launch "Execute" from DataImport (Web Admin Page) i get:

Indexing completed. Added/Updated: 1 documents. Deleted 0 documents. (Duration: 03s)
Requests: 0 (0/s), Fetched: 329 (110/s), Skipped: 0, Processed: 1 

I have many doc, pdf and docx inside this folter (appears... 329) but only the first has been processed and if i execute the query, i get only the filename... any content.

"response": {
    "numFound": 1,
    "start": 0,
    "docs": [
      {
        "fileName": "first_doc.doc",
        "id": "4a06f6de-870d-4db9-875d-cd8dbd17309d"
      }
    ]
  }

Where i wrong ?

1

1 Answers

3
votes

I assume you are using Apache Solr 5.0... I was also having the same problem that you were experiencing.

That seems to be related to an issue, that has been solved recently:

https://issues.apache.org/jira/browse/SOLR-7174

If you checkout the trunk version of Solr and use it, you will see that this issue is fixed, I will leave you a few links for you, if you want to test this yourself.

svn.apache.org/repos/asf/lucene/dev/trunk/ -> Link for SOLR checkout https://wiki.apache.org/solr/HowToCompileSolr -> How to compile SOLR and use it with your current install

Alternatively you can wait for a new SOLR release, that should have this problem fixed.