0
votes

I am running Solr 4.8.1 and indexing with the SimplePostTool (post.jar in the example\exampledocs directory).

I can successfully index xml, json, csv, pdf, doc, docx, ppt, pptx, xls, xlsx files but when attempting to index other files types ( .txt, ,7z .rar .EAP .sql .zip .avi) I have given the error:

"SimplePostTool: WARNING Solr returned an error #400 Bad Request SimplePostTool: WARNING: IOException while reading response: java.io.IOException: Server returned HTTP responsecode : 400 for URL: /"

Solr also tells me that it successfully indexed any text files I've included, but those "indexed" files don't show up in the browser I've set up for solr, or in solaritas, the default solr browser.

Is there a way to index files like the ones above to solr? - even if the context can't be indexed for some (such as the .avi) can the metadata be indexed? If so can it be done by editing the SimplePostTool or do I need something else?

EDIT: Since writing, I have found this question SOLR index and extract .sh and .sql files (very similar) which recommends editing the MIME map in SimplePostTool.java, however I cannot find that recommended portion of code anywhere in SimplePostTool.java! Where could I find this code? Is there an easier way to do this?

1

1 Answers

0
votes

I would use the Solr ExtractingUpdateRequestHandler otherwise known as Solr Cell: https://cwiki.apache.org/confluence/display/solr/Uploading+Data+with+Solr+Cell+using+Apache+Tika

From the documentation:

Solr uses code from the Apache Tika project to provide a framework for incorporating many different file-format parsers such as Apache PDFBox and Apache POI into Solr itself. Working with this framework, Solr's ExtractingRequestHandler can use Tika to support uploading binary files, including files in popular formats such as Word and PDF, for data extraction and indexing.

It's part of the Apache Solr project and supports a wide variety of file formats including video, audio, compressed files, text files, etc. An overview of the file types that can be loaded and parsed can be found here: http://tika.apache.org/1.5/formats.html

And some more info on getting started using it: https://wiki.apache.org/solr/ExtractingRequestHandler