Issue in parsing iWorksDocument with Apache Tika

Question

I was trying to parse iWorksDoc with Apache Tika. But am not getting parsed content as it is instead getting some other output from the content handler. Code snippet that I've used and the output I got is added below.

    private void parseFile(File file) {
    try{
        File file = new File("/home/user/tika/samples/budget.numbers");
        FileInputStream inputStream = new FileInputStream(file);
        ParseContext context = new ParseContext();
        BodyContentHandler bodyHandler = new BodyContentHandler(-1);
        Parser parser=new AutoDetectParser();
        parser.parse(inputStream, bodyHandler, new Metadata(), context);
        System.out.println("Contents of the file :"+bodyHandler.toString());
        }
        catch(IOException | SAXException | TikaException e){
            e.printStackTrace();
        }
}

Output :-

Contents of the file :
Index/Document.iwa
Index/ViewState.iwa
Index/CalculationEngine.iwa
Index/Tables/HeaderStorageBucket-2.iwa
Index/Tables/Tile.iwa
Index/Metadata.iwa
Metadata/Properties.plist

I'm able to detect the file type using Detector api correctly. But am not getting the useful content out of the document. Please help!

Tim Allison Tim Allison · Accepted Answer · 2016-05-02T13:10:33

Tika should be able to parse Numbers docs. If you're able to share the document, please post it to our Jira. As I look at the parser, we could handle namespaces a bit more robustly, and that could be the problem, but I can't tell without the doc.

Issue in parsing iWorksDocument with Apache Tika

1 Answers