Extract text from .tex files using Tika

Question

How do I extract text from a .tex file using Apache Tika? An example file is at http://www.tug.org/texshowcase/EulerGibbsDuhem.tex

Tika is able to correctly detect the content type as application/x-tex but does not extract anything from it.

I tried the command

java -jar tika-app-0.9.jar -t EulerGibbsDuhem.tex

and also the following code snippet:

File file = new File(fileName);
Tika tika = new Tika();
String mimeType = tika.detect(file);
pageContent = tika.parseToString(file);

Gagravarr Gagravarr · Accepted Answer · 2011-03-31T22:01:09

Tika supports detecting the .tex file extension, but there isn't a parser for it yet, sorry.

If you can find a good Java library (ideally Apache Licensed) for parsing .tex files, then I'd suggest you open a new enhancement request in the Tika JIRA (https://issues.apache.org/jira/browse/TIKA) and request a Tex Parser based on that library.

Extract text from .tex files using Tika

1 Answers