Extracting heading and paragraphs from doc and docx files using apache-poi

Question

I am trying to read Microsoft word documents via apache-poi and found that there are couple of convenient methods provided to scan through document like getText(), getParagraphList() etc.. But my use case is slightly different and the way we want to scan through any document is, it should give us events/information like heading, paragraph, table in the same sequence as they appear in document. It will help me in preparing a document structure like,

    <content>

    <section>

         <heading> ABC </heading>

         <paragraph>xyz </paragraph>

        <paragraph>scanning through APIs</paragraph>        

    <section>
    .
    .
    .

    </content>

The main intent is to maintain the relationship between heading and paragraphs as in original document. Not sure but can something like this work for me,

    Iterator<IBodyElement> itr = doc.getBodyElementsIterator();

        while(itr.hasNext()) {

          IBodyElement ele = itr.next();

          System.out.println(ele.getElementType());

        }

I was able to get the paragraph list but not heading information using this code. Just to mention, I would be interested in all headings, they might be explicitly marked as heading by using style or by using large font size.

There is no straight way provided for this in apache-poi but org.apache.tika.parser.microsoft.WordExtractor shows a trick to accomplish the same. — Prateek Jain

Gagravarr Gagravarr · Accepted Answer · 2015-04-09T19:19:54

Headers aren't stored inline in the main document, they live elsewhere, which is why you're not getting them as body elements. Body elements are things like sections, paragraphs and tables, not headers, so you have to fetch them yourself.

If you look at this code in Apache Tika, you'll see an example of how to do so. Assuming you're iterating over the body elements, and want headers / footers of paragraphs, you'll want code something like this (based on the Tika code):

for(IBodyElement element : bodyElement.getBodyElements()) {
    if(element instanceof XWPFParagraph) {
         XWPFParagraph paragraph = (XWPFParagraph)element;
         XWPFHeaderFooterPolicy headerFooterPolicy = null;

         if (paragraph.getCTP().getPPr() != null) {
            CTSectPr ctSectPr = paragraph.getCTP().getPPr().getSectPr();
            if(ctSectPr != null) {
               headerFooterPolicy = new XWPFHeaderFooterPolicy(document, ctSectPr);
               // Handle Header
            }
         }
         // Handle paragraph
         if (headerFooterPolicy != null) {
            // Handle footer
         }
    }
    if(element instanceof XWPFTable) {
         XWPFTable table = (XWPFTable)element;
         // Handle table
    }
    if (element instanceof XWPFSDT){
        XWPFSDT sdt = (XWPFSDT) element;
        // Handle SDT
    }
}

Extracting heading and paragraphs from doc and docx files using apache-poi

1 Answers