3
votes

I'm trying to count pages from a word document with java.

This is my actual code, i'm using the Apache POI libraries

String path1 = "E:/iugkh";
File f = new File(path1);
File[] files = f.listFiles();
int pagesCount = 0;
for (int i = 0; i < files.length; i++) {
    POIFSFileSystem fis = new POIFSFileSystem(new FileInputStream(files[i]));
    HWPFDocument wdDoc = new HWPFDocument(fis);
    int pagesNo = wdDoc.getSummaryInformation().getPageCount();
    pagesCount += pagesNo;
    System.out.println(files[i].getName()+":\t"+pagesNo);
}

The output is:

ten.doc:    1
twelve.doc: 1
nine.doc:   1
one.doc:    1
eight.doc:  1
4teen.doc:  1
5teen.doc:  1
six.doc:    1
seven.doc:  1

And this is not what i expected, as the first three documents' page length is 4 and the other are from 1 to 5 pages long.

What am i missing?

Do i have to use another library to count the pages correctly?

Thanks in advance

2
Sounds like Word hasn't bothered to update the statistics in the files (depressingly common). If you open the file in word, view the stats then save, does that fix it?Gagravarr
is this working now? tested it with poi-3.9 version it did worked for me. Thanksteckysols
have u resolved issue? can u tell me how you gets pages count?Muneem Habib
@MuneemHabib No, I did not solve the issue. It actually works with document metadata, if Word doesn't update it, you won't be able to get page count.BackSlash
@BackSlash i have offest value i want get page number where this offset lies any idea?Muneem Habib

2 Answers

2
votes

This may help you. It counts the number of form feeds (sometimes used to separate pages), but I'm not sure if it's gonna work for all documents (I guess it does not).

WordExtractor extractor = new WordExtractor(document);
String[] paragraphs = extractor.getParagraphText();

int pageCount = 1;
for (int i = 0; i < paragraphs.length; ++i) {
    if (paragraphs[i].indexOf("\f") >= 0) {
        ++pageCount;
    }
}

System.out.println(pageCount);
0
votes

This alas is a bug some versions of Word (pre-2010 versions apparently, possibly just in Word 9.0 aka 2000) or at least in some versions of the COM previewer that's used to count the pages. The apache devs refused to implement a workaround for it: https://issues.apache.org/jira/browse/TIKA-1523

In fact when you open the file in Word, it of course shows the real pages and it also recalculates the count, but initially it also shows "1". But here, the metadata as saved in the file is simply "1" or maybe nothing (see below). POI does not "reflow" the layout to calculate that information.

This is why the metadata is only updated by the word processing program on opening and editing the file. If you instruct Word 2010 to open the file "read only" (which it does because its downloaded from internet), it shows "" in the page column. See 2nd screenshot. So clearly a bug in this file, not TIKA's or POI's issue.

I also found in there that the bug (for Word 9.0/2000) was confirmed by MS: http://support.microsoft.com/kb/212653/en-us

If opening and re-closing with a new version of Word is not possible/available, another workaround would be to covert the documents to pdf (or even xps) and count the pages of that.