I have a PDF from which I extracted a page using PDFBox:
(...)
File input = new File("C:\\temp\\sample.pdf");
document = PDDocument.load(input);
List allPages = document.getDocumentCatalog().getAllPages();
PDPage page = (PDPage) allPages.get(2);
PDStream contents = page.getContents();
if (contents != null) {
System.out.println(contents.getInputStreamAsString());
(...)
This gives the following result, which looks like something you'd expect, based on the PDF spec.
q
/GS0 gs
/Fm0 Do
Q
/Span <</Lang (en-US)/MCID 88 >>BDC
BT
/CS0 cs 0 0 0 scn
/GS1 gs
/T1_0 1 Tf
8.5 0 0 8.5 70.8661 576 Tm
(This page has been intentionally left blank.)Tj
ET
EMC
1 1 1 scn
/GS0 gs
22.677 761.102 28.346 32.599 re
f
/Span <</Lang (en-US)/MCID 89 >>BDC
BT
0.531 0.53 0.528 scn
/T1_1 1 Tf
9 0 0 9 45.7136 761.1024 Tm
(2)Tj
ET
EMC
q
0 g
/Fm1 Do
Q
What I'm looking for is to extract the PDF TextObjects (as described in par 5.3 of the PDF spec) on the page as java Objects, so basically the pieces between BT an ET (two of 'en on this page). They should at least contain everything between the brackets preceding 'Tj' as a String, and an x and y coördinate based on the 'Tm' (or a 'Td' operator, etc.). Other attributes would be a bonus, but are not required.
The PDFTextStripper seems to give me either each character with attributes as a TextPosition (too much noise for my purpose), or all the Text as one long String.
Does PDFBox have a feature that parses a Page and provides TextObjects like this that I missed? Or else, if I am to extend PDFBox to get what I need, where should I start? Any help is welcome.
EDIT: Found another question here, that gives inspiration on how I might build what I need. If I succeed, I'll check back. Still looking forward to any help you may have, though.
Thanks,
Phil