31
votes

Is possible to use iTextSharp to remove from a PDF document objects that are not visible (or at least not being displayed)?

More details:

1) My source is a PDF page containing images and text (maybe some vectorial drawings) and embedded fonts.

2) There's an interface to design multiple 'crop boxes'.

3) I must generate a new PDF that contains only what is inside the crop boxes. Anything else must be removed from resulting document (indeed I may accept content which is half inside and half outside, but this is not the ideal and it should not appear anyway).

My solution so far:

I have successfully developed a solution that creates new temporary documents, each one containing the content of each crop box (using writer.GetImportedPage and contentByte.AddTemplate to a page that is exactly the size of the crop box). Then I create the final document and repeat the process, using the AddTemplate method do position each "cropped page" in the final page.

This solution has 2 big disadvantages:

  • the size of the document is the [original size] * [number of crop boxes], since the entire page is there, stamped many times! (invisible, but it's there)
  • the invisible text may still be accessed by selecting all (CTRL+A) within Reader and pasted.

So, I think I need to iterate through PDF objects, detect if it is visible or not, and delete it. At the time of writing, I am trying to use pdfReader.GetPdfObject.

Thanks for the help.

5
As iText provides a low level API which allows you to manipulate nearly everything in a document, it is possible. That is not to say that it is easy, though, as you will have to write the code yourself to identify for each element in the page content whether or not it is visible, and you will have to glue together the remaining parts of the content yourself, too. You can reduce the resulting document size in your current solution, though, if you reuse an imported page template if multiple sections of it are to be made visible. Interesting work for many weeks...mkl
Try using the PdfStamper class for cropping: itextpdf.com/examples/iia.php?id=231Markus Palme
I'm not a 100 percent on this as far as iTextSharp is concerned but iPdfSharp has the ability to render from forms. the idea is that you open your page, that you are cropping, inside a form and then render out only the parts you need into a new document. You will not be making multiple copies and the rendered (cropped) parts will be images. Try to see if this is an option under IText api.Alex
Due to time restrictions, I decided to use another PDF framework to accomplish what I need. For that I used the AmyUni PDF Creator .NET, a simple yet nice library. It has it`s own bugs though, but I'm interacting with them to solve.Hetote
Have you looked at ABCPdf? If I'm correct it can do exactly what you want to do, and pricing is about the same as the AmyUni lics.Peter R

5 Answers

1
votes

If the PDF which you are trying is a template/predefined/fixed then you can remove that object by calling RemoveField.

PdfReader pdfReader = new PdfReader(../Template_Path.pdf"));
PdfStamper pdfStamperToPopulate = new PdfStamper(pdfReader, new FileStream(outputPath, FileMode.Create));
AcroFields pdfFormFields = pdfStamperToPopulate.AcroFields;
pdfFormFields.RemoveField("fieldNameToBeRemoved");
1
votes
PdfReader pdfReader = new PdfReader(../Template_Path.pdf"));
PdfStamper pdfStamperToPopulate = new PdfStamper(pdfReader, new FileStream(outputPath, FileMode.Create));
AcroFields pdfFormFields = pdfStamperToPopulate.AcroFields;
pdfFormFields.RemoveField("fieldNameToBeRemoved");
1
votes

Yes, it's possible. You need to parse pdf page content bytes to PdfObjects, store them to the memory, delete unvanted PdfObject's, build Pdf content from PdfObject's back to pdf content bytes, replace page content in PdfReader just before you import the page via PdfWriter.

I would recommend you to check out this: http://habjan.blogspot.com/2013/09/proof-of-concept-converting-pdf-files.html

Sample from the link implements Pdf content bytes parsing, building back from PdfObjec's, replacing PdfReader page content bytes...

1
votes

Here is three solutions I found, if it can help someone (using iTextSharp, Amyuni or Tracker-Software, as @Hetote said in the comments he was looking for another library):

Using iTextSharp

As answered by @martinbuberl in another question:

public static void CropDocument(string file, string oldchar, string repChar)
{
    int pageNumber = 1;
    PdfReader reader = new PdfReader(file);
    iTextSharp.text.Rectangle size = new iTextSharp.text.Rectangle(
    Globals.fX,
    Globals.fY,
    Globals.fWidth,
    Globals.fHeight);
    Document document = new Document(size);
    PdfWriter writer = PdfWriter.GetInstance(document,
    new FileStream(file.Replace(oldchar, repChar),
    FileMode.Create, FileAccess.Write));
    document.Open();
    PdfContentByte cb = writer.DirectContent;
    document.NewPage();
    PdfImportedPage page = writer.GetImportedPage(reader,
    pageNumber);
    cb.AddTemplate(page, 0, 0);
    document.Close();
}

Another answer by @rafixwpt in his question, but it doesn't remove the invisible elements, it cleans an area of the page, which can affect other parts of the page:

static void textsharpie()
{
    string file = "C:\\testpdf.pdf";
    string oldchar = "testpdf.pdf";
    string repChar = "test.pdf";
    PdfReader reader = new PdfReader(file);
    PdfStamper stamper = new PdfStamper(reader, new FileStream(file.Replace(oldchar, repChar), FileMode.Create, FileAccess.Write));
    List<PdfCleanUpLocation> cleanUpLocations = new List<PdfCleanUpLocation>();
    cleanUpLocations.Add(new PdfCleanUpLocation(1, new iTextSharp.text.Rectangle(0f, 0f, 600f, 115f), iTextSharp.text.BaseColor.WHITE));
    PdfCleanUpProcessor cleaner = new PdfCleanUpProcessor(cleanUpLocations, stamper);
    cleaner.CleanUp();
    stamper.Close();
    reader.Close();
}

Using Amyuni

As answered by @yms in another question:

IacDocument.GetObjectsInRectangle Method

The GetObjectsInRectangle method gets all the objects that are in the specified rectangle.

Then you can iterate all the objects in the page and delete those that you are not interested in:

//open a pdf document
document.Open(testfile, "");
IacPage page1 = document.GetPage(1);
Amyuni.PDFCreator.IacAttribute attribute = page1.AttributeByName("Objects");

// listObj is an array list of graphic objects
System.Collections.ArrayList listobj = (System.Collections.ArrayList) attribute.Value.Cast<IacObject>();;

// listObjToKeep is an array list of graphic objects inside a rectangle
var listObjToKeep = document.GetObjectsInRectangle(0f, 0f, 600f, 115f,  IacGetRectObjectsConstants.acGetRectObjectsIntersecting).Cast<IacObject>();
foreach (IacObject pdfObj in listObj.Except(listObjToKeep))
{
   // if pdfObj is not in visible inside the rectangle then call pdfObj.Delete();
   pdfObj.Delete(false);
}

As said by @yms in the comments, another solution using the new method IacDocument.Redact in version 5.0 can also be used to delete all the objects in the specified rectangle and draw a solid color rectangle at their place.

Using Tracker-Software Editor SDK

I didn't try it but it seems possible, see this post.

0
votes

Have you tried using an IRenderListener? You can selectively add only those elements to the new pdf which fall within the crop regions by examining the StartPoint and EndPoint or Area of the TextRenderInfo or ImageRenderInfo objects.