0
votes

I've located a region of interest in the page by tracking TextPosition objects using PDFTextStripper as shown in the example: https://github.com/apache/pdfbox/blob/trunk/examples/src/main/java/org/apache/pdfbox/examples/util/PrintTextLocations.java

As shown, the TextPosition has been retrieved from fields like text.getXDirAdj(), text.getWidthDirAdj(), text.getYDirAdj(), text.getHeightDir() .

From this example I tried to keep everything else the same except setting the cropBox of the target page.

https://github.com/apache/pdfbox/blob/2.0.3/tools/src/main/java/org/apache/pdfbox/tools/PDFToImage.java

OLD CROPBOX: [0.0,0.0,595.276,841.89] -> NEW CROPBOX [50.0,42.0,592.0,642.0].

So how can I use the getYDirAdj and getXDirAdj in setting the cropbox correctly ?

The original pdf file I'm processing can be downloaded from here: http://downloadcenter.samsung.com/content/UM/201504/20150407095631744/ENG-US_NMATSCJ-1.103-0330.pdf

1
Please show the code you used and post a link to source and result pdf.Tilman Hausherr
Please be aware that the PDFBox text extraction favors a different coordinate system from the default user space coordinate system used for the page boxes. To retrieve user space coordinates, use getTextMatrix().getTranslateX() and getTextMatrix().getTranslateY()mkl
I've seen how from an BufferedImage, one can create a Graphics2D object, rotate it according to the affine transformation and then draw on it objects. I managed to manually fiddle with cropbox coordinates and got some parts of the image visible but I still cannot figure out what transformation I need to apply to the rectangle containing the text I want extracted to match the cropbox. PDRectangle is a beast Secondly, its not very clear to me when I need to transform the coordinates and what is the procedure that takes me from one realm to another. I would appreciate if you can clarify this.Dr. Vick
"All the pages end up white even thought the new crop box" - they are not white. I suspect your link is the original file.Tilman Hausherr
Re all the transformations, have a look at the DrawPrintTextLocations.java example in the source code download.Tilman Hausherr

1 Answers

1
votes

Cropping the page

In a comment the OP reduced his problem to

Ok. Given a java PDRectangle rect = new PDRectangle(40f, 680f, 510f, 100f) obtained from TextLocation how would a java code snippet, that sets the cropBox of a single page look like ? Or how would you do it? TextLocation based rect --> some transformation --> setCropBox(theRightBox).

To set the crop box of the page twelve of the given document to the given PDRectangle you can use code like this:

PDDocument pdDocument = PDDocument.load(resource);
PDPage page = pdDocument.getPage(12-1);
page.setCropBox(new PDRectangle(40f, 680f, 510f, 100f));
pdDocument.save(new File(RESULT_FOLDER, "ENG-US_NMATSCJ-1.103-0330-page12cropped.pdf"));

(SetCropBox.java test method testSetCropBoxENG_US_NMATSCJ_1_103_0330)

Adobe Reader now shows merely this part of page twelve:

Screenshot

Beware, though, the page in question does not only specify a media box (mandatory) and a crop box, it also defines a bleed box and an art box. Thus, application which consider those boxes more interesting than the crop box, might display the page differently. In particular the art box (being defined as "the extent of the page’s meaningful content") might by some applications be considered important.

Rendering the cropped page

In a comment to this answer the OP remarked

This is good and works. It correctly saves the page in the PDF file. I've tried to do the same in JPG and failed.

I reduced the OP's code to the essentials

PDDocument pdDocument = PDDocument.load(resource);
PDPage page = pdDocument.getPage(12-1);
page.setCropBox(new PDRectangle(40f, 680f, 510f, 100f));

PDFRenderer renderer = new PDFRenderer(pdDocument);
BufferedImage img = renderer.renderImage(12 - 1, 4f);
ImageIOUtil.writeImage(img, new File(RESULT_FOLDER, "ENG-US_NMATSCJ-1.103-0330-page12cropped.jpg").getAbsolutePath(), 300);
pdDocument.close();

(SetCropBox.java test method testSetCropBoxImgENG_US_NMATSCJ_1_103_0330)

The result:

Result image

Thus, I cannot reproduce an issue here.


Possible details to check for:

  • ImageIOUtil is not part of the main PDFBox artifact, instead it is located in pdfbox-tools; does the version of that artifact match the version of the core pdfbox artifact?
  • I run the code in an Oracle Java 8 environment; other Java environments might give rise to different results.
  • There are minor differences in our implementations. E.g. I load the PDF via an InputStream, you directly from file system, I have hardcoded the page number, you have it in some variable, ... None of these differences should cause your problem, but who knows...