1
votes

I have a very simple issue that is driving me crazy. Basically I want to extract, via POI/DOCX4J libraries, docx paragraph structure and document outline. I did the same task with a normal doc document using the POI paragraph.getLvl() method. Is there a way to get the same result with a docx? How can I re-construct the entire TOC structure of the docx?


Solution:

I resolved in this way:

    Map headingMap = new HashMap();
    headingMap.put("heading 1", 1);
    headingMap.put("heading 2", 2);
    headingMap.put("heading 3", 3);
    headingMap.put("heading 4", 4);
    headingMap.put("heading 5", 5);
    headingMap.put("heading 6", 6);
    headingMap.put("heading 7", 7);
    headingMap.put("heading 8", 8);
    headingMap.put("heading 9", 9);
    
    Iterator<XWPFParagraph> iterator = docx.getParagraphsIterator();
    Styles styles = getStyle(completePath);

    while(iterator.hasNext()){
        XWPFParagraph p = iterator.next();

        if( p != null && p.getStyleID() != null){
            for (Style s : styles.getStyle()){
                if (p.getStyleID().equals(s.getStyleId()) && headingMap.containsKey(s.getName().getVal())){
                    StringBuffer text = new StringBuffer();
                    for(XWPFRun run : p.getRuns()) {
                        text.append(run.toString());
                    }
                }
            }
        }
    }
1
Does POI support .docx?duffymo
yes, POI supports Office Open XML Format (OOXML) and thus docxYoBre

1 Answers

2
votes

The outline level can be set directly on the paragraph, or in the style hierarchy, so your real challenge is navigating the style hierarchy to get it.

A paragraph which has outline level set directly on it will look like:

        <w:p>
            <w:pPr>
                <w:outlineLvl w:val="2"/>
            </w:pPr>

Assuming paragraph object p, in docx4j, it'll be p.getPPr().getOutlineLvl

If the level is defined on some style s, for example:

        <w:style w:type="paragraph" w:styleId="Heading2">
            <w:name w:val="heading 2"/>
            <w:basedOn w:val="Normal"/>
            <w:pPr>
                <w:outlineLvl w:val="1"/>
            </w:pPr>

you can get it using something like (ignoring looking at whatever style it may be basedOn):

private int getOutlineLvl(Style s) {
    // Heading 1 is lvl 0
    // There are 9 levels, so 9 will be lvl 8
    // So return 9 for normal text
    if (s==null
            || s.getPPr()==null) return 9;

    OutlineLvl outlineLvl = s.getPPr().getOutlineLvl();
    if (outlineLvl==null) return 9;
    return outlineLvl.getVal().intValue();
}

In that case, the paragraph's pPr will contain something like:

                    <w:pStyle w:val="Heading2"/>

You get the style name from there, then need to look it up in the Styles part. Have a look at the docx4j source code to see how to do this.

The other thing you need to know is how to iterate over the paragraphs. Assuming you're not interested in any inside tables, you can just use a for loop over mdp.getContent() where mdp is the main document part. See the docx4j cheat sheet for more.