I am using apache tika command line tool to extract text from the doc and docx file. I can get the whole text but i am unable to get them in form of pages so that i can store each page separately. Is there any way to achieve that ?
0
votes
1 Answers
1
votes
Tika uses Apache POI to process Word files (both the old binary- and the newer XML-based flavors).
Since POI (fundamentally) cannot read out those page numbers and Tika is not meant to be a document renderer either, the answer is very simply: No, this is not possible.
For a little more insight on why your requirement (from a technical standpoint) does not make much sense, see my answer here.