6
votes

Is it possible to create tagged PDF(PDF/UA) with PDFBox? It looks like PDFBox has an API for that (package org.apache.pdfbox.pdmodel.documentinterchange.taggedpdf), but I can't find any tutorials or code examples.

Using the code below, I generated a PDF file containing an image, and the screen reader NVDA (in my case) recognizes it and reads '... graphic Alternate Description'. However, the accessibility checker PAC 2 shows an error: 'Image object not tagged'.

        PDDocument doc = new PDDocument();
        PDPage page = new PDPage();
        doc.addPage(page);
        PDDocumentCatalog documentCatalog = doc.getDocumentCatalog();

        PDImageXObject pdImage = PDImageXObject.createFromFile(imagePath, doc);
        PDPageContentStream contents = new PDPageContentStream(doc, page);
        contents.drawImage(pdImage, 100, 600, pdImage.getWidth() / 2, pdImage.getHeight() / 2);
        contents.close();

        PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
        PDStructureElement structureElement = new PDStructureElement(StandardStructureTypes.Figure, treeRoot);
        structureElement.setPage(page);

        PDMarkedContent markedImg = new PDMarkedContent(COSName.IMAGE, new COSDictionary());
        markedImg.addXObject(pdImage);

        structureElement.appendKid(markedImg);
        structureElement.setAlternateDescription("Alternate Description");
        treeRoot.appendKid(structureElement);
        documentCatalog.setStructureTreeRoot(treeRoot);
        // ....
        doc.save(fileName);

Can you provide some explanations or/and code examples about this subject?

1
There are no examples, sadly, mostly because none of us is involved with creating such files, AFAIK. (I am a PDFBox committer) The only thing I can do for you is to fix any bugs you may find. What you could do is to create a file with a different tool, then use PDFBox PDFDebugger to see the structure and reproduce it. - Tilman Hausherr
@TilmanHausherr , thanks for PDFDebugger. The main question now is how to write PDStructureElement directly in PDPageContentStream. - Leonid Muzyka
I assume you mean BMC, BDC, EMC, MP, DP. At this time you'd need to use the (deprecated) "raw" methods. Or you create a request in JIRA for some new methods :-) - Tilman Hausherr
PDFBox 1.8 can create PDF/A, but only PDF/A-1b, not PDF/A-1a, which also covers PDF/UA. I haven't been able to find out if PDFBox 2.0 supports PDF/A-1a. If a PDF/A document generated with PDFBox 2 does not have accessibility tags, I would assume it is not supported yet? - Tsundoku
@leomuz, do you have acrobat? you can run the accessibility checker within acrobat to see if it has the same error as pac2. you can also look at the tag tree (view > show/hide > nav panes > tags). if you don't have acrobat, you can contact me offline and i can take a look at your file. look at my stackoverflow profile to see how to contact me. i can't help with pdfbox but perhaps seeing where the error is might help. - slugolicious

1 Answers

3
votes

I put up a working example which demonstrates creating an accessible PDF using PDFBox 2: https://github.com/martinlovell/accessible-pdfbox-example

There are a few things missing from the code in the question. The marked content needs alt text, and I believe you need mcids for that marked content.

The example project demonstrates in more detail what you need.

It would be something like this:

PDPageContentStream contents = new PDPageContentStream(doc, page);

// the content in the stream needs an id
int mcid = 5;
COSDictionary dictionary = new COSDictionary();
dictionary = new COSDictionary();
dictionary(COSName.MCID, mcid);

// wrap image drawing in marked content
contents.beginMarkedContent(COSName.IMAGE, PDPropertyList.create(dictionary));
contents.drawImage(pdImage, 100, 600, pdImage.getWidth() / 2, pdImage.getHeight() / 2);
contents.endMarkedContent();

contents.close();

PDStructureTreeRoot treeRoot = new PDStructureTreeRoot();
documentCatalog.setStructureTreeRoot(treeRoot);
PDStructureElement structureElement = new PDStructureElement(StandardStructureTypes.Figure, treeRoot);
structureElement.setPage(page);
structureElement.setAlternateDescription("Alternate Description");

// Set alt text on marked content for structure.  
// This is the dictionary with the mcid used in beginMarkedContent.
dictionary.setString(COSName.ALT, "Alternate Description");
PDMarkedContent markedImg = new PDMarkedContent(COSName.IMAGE, dictionary);
markedImg.addXObject(pdImage);
structureElement.appendKid(markedImg);