I have to build an application where i have to search belong PDF,DOC,DOCX etc files. I would like to use Solr to index the entire directory that contains all my files and next search for word inside the documents.
Looking on the net i've seen that the faster ways is to use DIH. i've setup this inn this way:
solrConfig.xml
<requestHandler name="/update/extract"
startup="lazy"
class="solr.extraction.ExtractingRequestHandler" >
<lst name="defaults">
<str name="lowernames">true</str>
<str name="uprefix">ignored_</str>
<!-- capture link hrefs but ignore div attributes -->
<str name="captureAttr">true</str>
<str name="fmap.a">links</str>
<str name="fmap.div">ignored_</str>
</lst>
</requestHandler>
solr-data-config.xml
<dataConfig>
<dataSource type="BinFileDataSource" name="bin"/>
<document>
<entity name="sd"
processor="FileListEntityProcessor"
baseDir="C:\Solr\solr-5.0.0\docs\myFolder\"
fileName=".*\.(doc)|(pdf)|(docx)"
recursive="true"
rootEntity="false"
transformer="DateFormatTransformer">
<entity name="tika-test" processor="TikaEntityProcessor" url="${sd.fileAbsolutePath}"
format="text">
<field column="text" name="text"/>
</entity>
<field column="fileSize" name="size" />
<field column="file" name="filename" />
</entity>
</document>
</dataConfig>
When i launch "Execute" from DataImport (Web Admin Page) i get:
Indexing completed. Added/Updated: 1 documents. Deleted 0 documents. (Duration: 03s)
Requests: 0 (0/s), Fetched: 329 (110/s), Skipped: 0, Processed: 1
I have many doc, pdf and docx inside this folter (appears... 329) but only the first has been processed and if i execute the query, i get only the filename... any content.
"response": {
"numFound": 1,
"start": 0,
"docs": [
{
"fileName": "first_doc.doc",
"id": "4a06f6de-870d-4db9-875d-cd8dbd17309d"
}
]
}
Where i wrong ?