How to extract text from docx with Tika

Question

I'm trying to extract text from a docx: tika-app does it well, but when I try to do the same thing in my code the result is nothing and the tika parser says that the content-type of my docx file is "application/zip".

How can i do? Should I use a recursive approach (like this) or there is another way?

UPDATE: The file content-type is now correctly detected if I add the filename to the metadata:

InputStream is =  new FileInputStream(myFile);
AutoDetectParser parser = new AutoDetectParser();
BodyContentHandler handler = new BodyContentHandler();
Metadata metadata = new Metadata();
metadata.set(Metadata.RESOURCE_NAME_KEY, myFileFilename);
ParseContext context = new ParseContext();
context.set(Parser.class, parser);
parser.parse(is, handler, metadata, context);

Anyway at parse() i get the error

java.lang.NoClassDefFoundError: org/apache/poi/openxml4j/exceptions/InvalidFormatException at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:82)

Sounds like you're probably missing some Tika jars and/or their dependencies. How did you install/package Tika in your app? — Gagravarr
Hi @Gagravarr. Following the instructions on the site (tika.apache.org/1.9/gettingstarted.html), I build the artifacts (tika-core, tika-parsers) with maven and then import them in my project. Then I use the AutoDetectParser to check for the content-type of my file and parse it. It works for other types like ods, odt, txt...but not with docx — Kue
Why are you building the Tika jars? You should probably just follow the " Using Tika as a Maven dependency" section. Secondly, how are you copying the dependencies into your project? — Gagravarr
Ok, thank you. Can I just download jars from mvn repository (mvnrepository.com/artifact/org.apache.tika/tika-parsers/1.9) and move them in my lib folder? It doesn't seems to work. — Kue
No, you need to have all the dependencies too, as defined in the pom files. Probably best to use Maven and have it do it all for you — Gagravarr

Zaven Zareyan Zaven Zareyan · Accepted Answer · 2016-04-16T10:16:21

For me the main confusing thing in Apache Tika that it can be compiled without tika-parsers.jar, but it obviously can't work without it. So make sure that you installed tika-parsers.jar with all dependencies (they are many).

How to extract text from docx with Tika

1 Answers