Getting PDF TextObjects with PDFBox

Question

I have a PDF from which I extracted a page using PDFBox:

(...)
File input = new File("C:\\temp\\sample.pdf");
document = PDDocument.load(input);
List allPages = document.getDocumentCatalog().getAllPages();
PDPage page = (PDPage) allPages.get(2);
PDStream contents = page.getContents();
if (contents != null) {
System.out.println(contents.getInputStreamAsString());
(...)

This gives the following result, which looks like something you'd expect, based on the PDF spec.

q
/GS0 gs
/Fm0 Do
Q
/Span <</Lang (en-US)/MCID 88 >>BDC 
BT
/CS0 cs 0 0 0  scn
/GS1 gs
/T1_0 1 Tf
8.5 0 0 8.5 70.8661 576 Tm
(This page has been intentionally left blank.)Tj
ET
EMC 
1 1 1  scn
/GS0 gs
22.677 761.102 28.346 32.599 re
f
/Span <</Lang (en-US)/MCID 89 >>BDC 
BT
0.531 0.53 0.528  scn
/T1_1 1 Tf
9 0 0 9 45.7136 761.1024 Tm
(2)Tj
ET
EMC 
q
0 g
/Fm1 Do
Q

What I'm looking for is to extract the PDF TextObjects (as described in par 5.3 of the PDF spec) on the page as java Objects, so basically the pieces between BT an ET (two of 'en on this page). They should at least contain everything between the brackets preceding 'Tj' as a String, and an x and y coördinate based on the 'Tm' (or a 'Td' operator, etc.). Other attributes would be a bonus, but are not required.

The PDFTextStripper seems to give me either each character with attributes as a TextPosition (too much noise for my purpose), or all the Text as one long String.

Does PDFBox have a feature that parses a Page and provides TextObjects like this that I missed? Or else, if I am to extend PDFBox to get what I need, where should I start? Any help is welcome.

EDIT: Found another question here, that gives inspiration on how I might build what I need. If I succeed, I'll check back. Still looking forward to any help you may have, though.

Thanks,

Phil

The best you'll get with PDFBox are the tokens returned by PDFStreamParser. Not exactly the text object but a collection of operations from which you can isolate the text object. — mkl

Phil Phil · Accepted Answer · 2014-08-21T13:20:55

Based on the linked question and the hint by mkl yesterday (thanks!), I've decided to build something to parse the tokens. Something to consider is that within a PDF Text Object, the attributes precede the operator, so I collect all attributes in a collection until I encounter the operator. Then, when I know what operator the attributes belong to, I move them to their proper locations. This is what I've come up with:

import java.io.File;
import java.util.List;

import org.apache.pdfbox.pdfparser.PDFStreamParser;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFOperator;

public class TextExtractor {
    public static void main(String[] args) { 
        try {
            File input = new File("C:\\some\\file.pdf");
            PDDocument document = PDDocument.load(input);
            List allPages = document.getDocumentCatalog().getAllPages();
            // just parsing page 2 here, as it's only a sample
            PDPage page = (PDPage) allPages.get(2);
            PDStream contents = page.getContents();
            PDFStreamParser parser = new PDFStreamParser(contents.getStream());
            parser.parse();  
            List tokens = parser.getTokens();  
            boolean parsingTextObject = false; //boolean to check whether the token being parsed is part of a TextObject
            PDFTextObject textobj = new PDFTextObject();
            for (int i = 0; i < tokens.size(); i++)  
            {  
                Object next = tokens.get(i); 
                if (next instanceof PDFOperator)  {
                    PDFOperator op = (PDFOperator) next;  
                    switch(op.getOperation()){
                        case "BT":
                            //BT: Begin Text. 
                            parsingTextObject = true;
                            textobj = new PDFTextObject();
                            break;
                        case "ET":
                            parsingTextObject = false;
                            System.out.println("Text: " + textobj.getText() + "@" + textobj.getX() + "," + textobj.getY());
                            break;
                        case "Tj":
                            textobj.setText();
                            break;
                        case "Tm":
                            textobj.setMatrix();
                            break;
                        default:
                            //System.out.println("unsupported operation " + op.getOperation());
                    }
                    textobj.clearAllAttributes();
                }
                else if (parsingTextObject)  {
                    textobj.addAttribute(next);
                }
            }
        } catch (Exception e) {
            e.printStackTrace();
        } 
    }
}

In combination with:

import java.util.ArrayList;
import java.util.List;

import org.apache.pdfbox.cos.COSFloat;
import org.apache.pdfbox.cos.COSInteger;
import org.apache.pdfbox.cos.COSString;

class PDFTextObject{
    private List attributes = new ArrayList<Object>();
    private String text = "";
    private float x = -1;
    private float y = -1;

    public void clearAllAttributes(){
        attributes = new ArrayList<Object>();
    }

    public void addAttribute(Object anAttribute){
        attributes.add(anAttribute);
    }

    public void setText(){
        //Move the contents of the attributes to the text attribute.
        for (int i = 0; i < attributes.size(); i++){
            if (attributes.get(i) instanceof COSString){
                COSString aString = (COSString) attributes.get(i);
                text = text + aString.getString();
            }
            else {
                System.out.println("Whoops! Wrong type of property...");
            }
        }
    }

    public String getText(){
        return text;
    }

    public void setMatrix(){
        //Move the contents of the attributes to the x and y attributes.
        //A Matrix has 6 attributes, the last two of which are x and y
        for (int i = 4; i < attributes.size(); i++){
            float curval = -1;
            if (attributes.get(i) instanceof COSInteger){
                COSInteger aCOSInteger = (COSInteger) attributes.get(i); 
                curval = aCOSInteger.floatValue();

            }
            if (attributes.get(i) instanceof COSFloat){
                COSFloat aCOSFloat = (COSFloat) attributes.get(i);
                curval = aCOSFloat.floatValue();
            }
            switch(i) {
                case 4:
                    x = curval;
                    break;
                case 5:
                    y = curval;
                    break;
            }
        }
    }

    public float getX(){
        return x;
    }

    public float getY(){
        return y;
    }
}

It gives the output:

Text: This page has been intentionally left [email protected],576.0
Text: [email protected],761.1024

While it does the trick, I'm sure I've broken some conventions and haven't always written the most elegant code. Improvements and alternate solutions are welcome.

Getting PDF TextObjects with PDFBox

2 Answers