I'm trying to extract text from a docx: tika-app does it well, but when I try to do the same thing in my code the result is nothing and the tika parser says that the content-type of my docx file is "application/zip".
How can i do? Should I use a recursive approach (like this) or there is another way?
UPDATE: The file content-type is now correctly detected if I add the filename to the metadata:
InputStream is = new FileInputStream(myFile);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, myFileFilename);
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(is, handler, metadata, context);
Anyway at parse() i get the error
java.lang.NoClassDefFoundError: org/apache/poi/openxml4j/exceptions/InvalidFormatException at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)