0
votes

I want to get headings h1, h2 from a word docx file with the page number from where it is fetched. e.g. there are headings "heading h1" and "heading h2" in page 1 and other h1,h2 headings on other pages. I want to get these with the page number they are fetched from. Can be something like

array(
    0 => array( 
       h1 => array('h1 headings goes here'),
       h2 => array('h2 headings goes here...')
       page=>'page number here'))

I am able to get headings by converting docx to zip and reading the xml using DOM Document. But I am not able to get the page number from where I picked a particular heading.

Please share the best way to achieve this functionality.

1
Can you please share what you have tried till now.... This is not the place to get your work done?Utkarsh Dixit
Simplest, in the end, might be to generate a TableOfContents, parse that, then remove it again from the document (or close the document without saving).Cindy Meister
Hi, I have tried reading docx by first converting them into zip and then reading the its document.xml using DOM Document. I can read the content but not able to get from which page I get a particular contentArsh
Hi Cindy, can you please that explain in more detail? Do you suggest the xml read method?Arsh
Referring to your other reply: If you have to work with the underlying Word Open XML and not with automating the Word application then it is NOT possible to get the page numbers. Word does not store the paging information in a document because page layout is generated "on-the-fly" everytime the document is opened/edited in the UI.Cindy Meister

1 Answers

0
votes

I doubt that the pagenumber is even stored in the docx, as it does not have to be generated before printing. That Word can show it during edit is because it generates it, but not stores it, for display.

As Cindy Meister mentions in a comment to your question, you may get the pagenumber from a table of contents (or index) if there is one in the document. In that case just find the line in the toc corresponding to your h1, h2.

But even then it might not be updated until the document is printed.