0
votes

I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?

1
Are you aware that the Microsoft Word file format is run-based and not page-based? - Gagravarr

1 Answers

1
votes

Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).

Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.

For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.