0
votes

I have a problem with 'XWPFDocument'. My part of the program gets 'docx' files and copy all content from them to one output 'docx' file. Include text, tables, pictures and formula. And I have a good result in this, but lately I got a bug: one picture was not copied into the result. this is source and this is result In result you can see what images in part "3.1.6.2" was successfully copied, but not in "3.1.6.1".

And there is how i do it:

for (XWPFRun run : oldParagraph.getRuns()) {

        XWPFRun newRun = newParagraph.createRun()

        if (run.getText(0) != null && !run.getText(0).isEmpty()) {
            .... copy text ....
        }
        if (run.getEmbeddedPictures() != null && run.getEmbeddedPictures().size() > 0) {
            for (XWPFPicture pic : run.getEmbeddedPictures()) {

                byte[] img = pic.getPictureData().getData()
                long cx = pic.getCTPicture().getSpPr().getXfrm().getExt().getCx()
                long cy = pic.getCTPicture().getSpPr().getXfrm().getExt().getCy()
                int pictureType = pic.getPictureData().getPictureType()

                XWPFDocument document = newParagraph.getDocument()

                String blipId = document.addPictureData(new ByteArrayInputStream(img), pictureType)
                createPictureCxCy(document, blipId, document.getNextPicNameNumber(pictureType), cx, cy)
            }
        }
    }

The key point here is:

 for (XWPFPicture pic : run.getEmbeddedPictures())

I am getting embedded pictures from 'run'. In bad file i have 5 'paragraphs' with 1 'run' inside each, 4 of them have a text, and 1 is empty. Usually, exactly this empty 'run' has embedded picture, and judging by the order the picture should be here. Now it empty at all. But in XWPFDocument this picture exist, in list of 'pictures' and 'packagePictures'.

The problem: this list have 'XWPFPictureData' objects, witch not contain information about location in the document and picture scales. But 'run.getEmbeddedPictures()' contains 'XWPFPicture' - what do we need. Is there any way out of this situation?

Update for the first comment.

I checked:

        for(XWPFParagraph paragraph: document.getParagraphs()) {
            for (XWPFRun run : paragraph.getRuns()) {
                println "run text: " + run.getText(0)
                println "embedded picture count: " + run.getEmbeddedPictures().size()
            }
        }
        println "*** for document picture count: " + document.allPictures.size()

Result was:

run text:      3.1.6.1 В ряде районов сейсмические нагрузки  на СПБУ ...
embedded picture count: 0
run text:      Интегральное сейсмическое воздействие на  СПБУ ...
embedded picture count: 0
run text: null
embedded picture count: 0
run text:  Рис. 3.1.6.1 Обообщенный коэффициент динамичности: ...
embedded picture count: 0
run text:      Р01 — низшая частота горизонтальных колебаний
embedded picture count: 0
*** for document picture count: 4

I no idea why picture count is 4. And second, about anchor. I did not find it. Moreover, I did not find it and others - the right files. In one article I read: "Objects can be placed in your document in two ways: either inline or floating." - and only floating object have anchor.

1
Maybe you are searching in the wrong paragraph? Open the source Word file in Microsoft Word. Select the missing picture. Look whether you will see a little anchor symbol somewhere. This points to the paragraph with the run, the picture is anchored on.Axel Richter
"only floating object have anchor": Sure, this is true. My comment was only a suspicion. But if the picture is embedded inline, then it should also be get. So we need the affected source document to be able to reproducing this behavior.Axel Richter
Sorry, I haven't a Google account. So I can't help.Axel Richter
O, I didn't think about that. Here you are: dropmefiles.com/T2YrdIvan.K.

1 Answers

1
votes

The document contains not a simple picture but a Drawing Canvas as mentioned in Add a drawing to a document.

This drawing canvas contains two pictures aligned to each other.

You can see this after opening in Word as you can select three objects there. The canvas and the two pictures in it:

enter image description here

This drawing canvas is represented in the document.xml in a AlternateContent element:

<mc:AlternateContent>
 <mc:Choice Requires="wpg">
  <w:drawing>
   <wp:inline distT="0" distB="0" distL="0" distR="0">
    ...
   </wp:inline>
  </w:drawing>
 </mc:Choice>
 <mc:Fallback>
  <w:pict>
   ...
  </w:pict>
 </mc:Fallback>
</mc:AlternateContent>

Apache poi cannot interpret this XML. At least until now.

One could write its own method interpreting this XML. But this will be a bigger task.

If there are not too much of those files, the best you can do is opening the *.docx using Word, select and cut the whole canvas to the clipboard and then paste it back as JPEG picture using Paste Special. Then save the *.docx again.

Or get the two pictures from the document. But there they are twice because the AlternateContent element provides a Fallback element having the canvas content again as Base64 encoded ZIP archive and also again having the picture references. That's why the *** for document picture count: 4.

Simply unzip the *.docx archive and look in /word/document.xmlto see this.