Get text from doc/docx file in pages using Apache tika

Question

I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?

Are you aware that the Microsoft Word file format is run-based and not page-based? — Gagravarr

morido morido · Accepted Answer · 2016-01-15T14:10:09

Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).

Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.

For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.

Get text from doc/docx file in pages using Apache tika

1 Answers