0
votes

Disclaimer:

I am using iText 5. I know this is generally frowned upon (vs. using iText 7), but I am working with considerable legacy code that uses iText 5, and upgrading does not fall under my control.

Requirements:

  • A "simple" PDF/A is received as input (text only, these are generated from RTF), as well as a float value corresponding to a desired first page length in inches.
  • A PDF/A must be output that is identical to the input PDF, except it is paginated as follows: first page length = input value; each subsequent (not first or last) page will fill a standard page length; the last page will be truncated a constant number of points below the content nearest the bottom of the page. Note that input and output width will be identical and constant.

Progress / Approach:

I have extended the SimpleTextExtractionStrategy to generate XML containing font information (size and family, bold or italics, etc.) as well as location information (relative an absolute coordinate system where the origin is at the top left corner of the first page of the input PDF) for each "span" of text extracted from the input PDF.

I then generate a new PDF page by page (where each page is the desired length according to the requirements outlined above), filtering the extracted XML info with LINQ based on the bounds of each new page, and adding appropriately formatted text at the appropriate location using ColumnText.ShowTextAligned(...).

Problem:

The approach outlined above does fine. It generates PDFs with the desired page structure, but some information is lost in translation, namely colored text and underlined text. While colored text shouldn't be seen in these PDFs, underlined text absolutely must be detected.

This set of requirements should also include PDFs with tables. I originally planned on implementing a different module that adheres to the same interface for table PDFs, as these are generated and used separately from the PDFs generated from RTF, and iText has relatively strong table functionality built in.

The two concerns outlined above, coupled with the fact that my described approach was born out of an attempt to reuse existing code leads me to believe that an entirely different approach may be necessary or at least much better. It seems to me that there should be a way to capture content byte info and clip it as necessary to "re-paginate" the input PDF, only worrying about moving content that falls along a page boundary.

Essentially, I am looking for (iText based) recommendations for a better approach. Pseudo-code type answers or simply recommendations for classes / interfaces that may help are acceptable. While it would be nice to handle text and tables together, any advice pertinent to one or the other would also be appreciated. I have perused much of the available documentation on the iText website and other SO questions, but have not found quite what I'm looking for.

Note that no code is included in this question as I am looking for a high-level approach that is entirely different from what I have tried.

Edit:

I didn't notice it before, but the way in which I was reusing fonts (similar to this) resulted in some unexpected (but documented as such) behavior. It seems that I will need to avoid extracting information for re pagination at the text level, as it will be difficult to ensure continuity of fonts between input and output.

1
Is resizing the page enough or do you want to reflow the content, e.g. if the first page becomes big enough to fit the first paragraph of the second page? The answer (and difficulty) varies based on this requirement.Michaël Demey
Also, if your company has been distributing this application for a long time, either you are also distributing the full source code (AGPL use) or you have a commercial license with iText Software (in case the source code is closed source). If the former, please show us the full source code; if the latter, please contact support at iText Software.Bruno Lowagie
Where do you get the impression that there is some kind of "distribution"? His company might be using the software only internally and thus not distribute it then neither applies. Which would then be the 3rd case according to your two cases...Lonzak
@MichaëlDemey The first page will only ever become shorter, content from the second page will never fit on the first. Content will need to flow from the first page to the second.user8061994
@BrunoLowagie Lonzak is correct, this application is used strictly internally and will only ever be used internally.user8061994

1 Answers

0
votes

I solved this problem a while ago, but figured I would post my solution. I'm sure it's not the most efficient solution, but it works well for my purposes. Note that this will re-paginate a PDF as described in the question containing text only. Table PDF's are handled separately.

The basic process is this:

  1. Use a custom TextExtractionStrategy to extract XML containing information regarding ascent and descent lines for all text in the input PDF, as well as what page it originally appears on.
  2. Given the page length requirements as described in the question (first page = input value, subsequent = standard length, last page = fit content) and the XML info regarding text positions, determine what content will fit on each page of the output PDF. Create a map of where each input page will need to be cropped (top and bottom, note that each input page may be cropped more than once), as well as a map of which cropped pages will need to be "concatenated" together in the final output.
  3. Copy the input PDF page by page to an intermediate temporary PDF (using PdfCopier). If an input page must be cropped more than once (ex: first 2 inches of input page 1 = page 1 output, next 6 inches of input page 1 = page 2 output, final 0.5 inch of input page 1 = top of page 3 output), ensure that it is copied the appropriate number of times (1 time per crop).
  4. Crop each page of the intermediate copied PDF appropriately. This is done by modifying the MediaBox and / or CropBox.
  5. Concatenate the appropriate cropped pages together into the final output PDF's pages. I used a PdfWriter to first create a new page of the appropriate height, then add each appropriate cropped page at the appropriate position in the output PDF page's byte content usingcontentByte.AddTemplate(inputCroppedPage, 0, bottomOfLastAddedCroppedPage).

To anyone who managed to read and understand all of that, congratulations. To anyone else, please let me know what you if you are confused. The solution described above is a little twisted and tough to put into words. While there is too much code to post here (and I am not at liberty to share the code on GitHub or similar), I would be happy to answer any questions that will help someone else implement something similar.

The TextExtractionStrategy mentioned in step 1 was inspired by this answer. Essentially, I used System.Xml.Linq to create an XML document rather than concatentating strings to form HTML, and I ignored any font information, storing only information regarding where text is located in the page (you'll see that this information is available in the linked answer, just isn't written into the final HTML).