0
votes

Can you use ExtractingRequestHandler and Tika with any of the compressed file formats (zip, tar, gz, etc) to extract the content out for indexing?

I am sending solr the archived.tar file using curl. curl " http://localhost:8983/solr/update/extract?literal.id=doc1&fmap.content=body_texts&commit=true" -H 'Content-type:application/octet-stream' --data-binary "@/home/archived.tar" The result I get when I query the document is that the file names inside the archive are indexed as the "body_texts", but the content of those files is not extracted or included. This is not the behavior I expected. Ref: http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Content-Extraction-Tika#article.tika.example. When I send 1 of the actual documents inside the archive using the same curl command the extracted content is then stored in the "body_texts" field. Am I missing a step for the compressed files?

I have added all the extraction dependencies as indicated by mat in http://outoftime.lighthouseapp.com/projects/20339/tickets/98-solr-cell and am able to successfully extract data from MS Word, PDF, HTML documents.

I'm using the following library versions. Solr 1.40, Solr Cell 1.4.1, with Tika Core 0.4

Given everything I have read this version of Tika should support extracting data from all files within a compressed file. Any help or suggestions would be appreciated.

1

1 Answers

1
votes

The short answer: Solr Cell 1.4.1 and Tika Core 0.6.

The long answer: After a lot of headaches I was able to get this working. I'll answer it for both people using solr directly and for people using solr with the Ruby library sunspot (which was my problem).

Here was what I did: I used this https://github.com/tomasc/sunspot_cell plugin to extend sunspot and give it the attachment feature. (Ignore this step if you're not using ruby/sunspot)

v1.4.1 works for individual files but not with compressed files, so I had to explore a bit. I downloaded the v1.4.1 codebase from http://lucene.apache.org/solr/ and grabbed the dist/apache-solr-cell-1.4.1.jar then I had to pull down the Tika libraries from the 1.5 branch http://svn.apache.org/viewvc/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/.

You can download each individually, or you can use svn to checkout the branch by

svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev

Or just checkout the library folder:

svn co http://svn.apache.org/repos/asf/lucene/solr/branches/branch-1.5-dev/contrib/extraction/lib/