0
votes

I'm extracting text from a PDF document. This PDF was generated using a WS reading Data From AS400 . So when printing text, the output is like :

orem ipsum dolor sit amet, **«VS123»**  In eros risus, «VS124» sed felis quis, commodo interdum tellus. Donec vitae massa

And «VS123» , «VS124» are variables from AS400.The Java APi is not able to read Value from variable and its printing Variable name instead of variable values.

I'm using PDFBox https://pdfbox.apache.org/ to extract text. The code source is like :

import java.io.File;
import java.io.IOException;
import java.util.List;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;
import org.apache.pdfbox.pdmodel.interactive.form.PDNonTerminalField;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;

public class App 
{
     public static void main( String[] args ) throws IOException
        {
        try (PDDocument document = PDDocument.load(new File("C:/my.pdf"))) {

            document.getClass();

            if (!document.isEncrypted()) {

                PDFTextStripperByArea stripper = new PDFTextStripperByArea();
                stripper.setSortByPosition(true);

                PDFTextStripper tStripper = new PDFTextStripper();

                String pdfFileInText = tStripper.getText(document);

                // split by whitespace
                String lines[] = pdfFileInText.split("\\r?\\n");
                for (String line : lines) {
                    System.out.println(line);
                }
                document.close();
            }
        }
    }
}

The output starts with this stack of error :

AVERTISSEMENT: Invalid ToUnicode CMap in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+77 (77) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+111 (111) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+110 (110) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+116 (116) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+97 (97) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+32 (32) in font ArialMT

I'tried also to exract text using iText :

import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;

import java.io.IOException;

public class App {
    private static final String FILE_NAME = "C:/my.pdf";

    public static void main(String[] args) {

        PdfReader reader;

        try {

            reader = new PdfReader(FILE_NAME);

            String textFromPage = PdfTextExtractor.getTextFromPage(reader, 1);

            System.out.println(textFromPage);

            reader.close();

        } catch (IOException e) {
            e.printStackTrace();
        }

    }
}

Here is the part of the PDF document :

enter image description here

When Tryin to extract text, or with Copy-paste, The output will be this :

CLIENT N° «VS35» « VS36 » CONTRAT N° «VS28»

The link to the PDF File: https://drive.google.com/file/d/1RNea028nCReIVS8nRWNlBwUwBsDOhDYg/view?usp=sharing

2
Question is not about iText, tag removedAmedee Van Gasse
People add the iText tag, because they consider iText to be the best PDF library and iText developers to be the best PDF experts. In spite of that knowledge, some of those people don't use iText because they don't like the idea of paying developers for the software they write or paying experts for their expertise. They probably dislike developers and experts. I guess that the AS 400 codes aren't part of the content stream, but were added as annotations instead. That would explain why they are visible when viewing or printing the document, but not present when extracting the content of a page.Bruno Lowagie
@AmedeeVanGasse I added itext Tag because I used Also itext for extracting the contentwikimix
Your code in your question does not reflect your usage of iText.Amedee Van Gasse
Either share the file or have a look at it with Adobe Reader to see if there are annotations / comments in the file.Tilman Hausherr

2 Answers

2
votes

The variables are rendered white in the PDF, as can be seen with PDFDebugger (excerpt from the second content stream of page 1):

BT
  /F3 9 Tf
  1 0 0 1 70.944 30.6 Tm
  1 g
  1 G
  [ (\253) ] TJ
ET
BT
  1 0 0 1 75.984 30.6 Tm
  [ (VS1) -2 (1) -3 (3) ] TJ
ET

"1 g" is maximum from /DeviceGray so that is white. So that part puts out "«VS113".

The values come much later in the PDF... One of them appears at the end of the content stream of the XObject form (a sequence of PDF operations) "X2":

BT
  1.0 0.0 0.0 1.0 153.3 457.35144 Tm
  0.0 3.57696 Td
  0 Tr
  /DeviceRGB cs
  0.0 0.0 0.0 sc
  /TCCZPJ+ArialMT 11.04 Tf
  [ (\0003\0001\0008\000 \0009\0007\0008\000 \0000\0001\0002) ] TJ
  0.0 -3.57696 Td
ET

"0.0 0.0 0.0 sc" means black, and the next-to-next line has 318 978 012. This can't be extracted due to an error reading the /ToUnicode stream. That stream should map each code to a unicode but that is missing. (You may think that it is visually obvious here, but things are not always so).

The only thing that is weird is that Adobe Reader gets the values.

From looking at the components of the PDF, it seems that in the first step, a PDF is generated with these "variables" printed white on white. In a second step, a second software finds these variables and prints the actual text at their place.

0
votes

AFAIK, the PDF doesn't contain variable data as displayed in the text. If there are any variables there, they might have converted to be used by it's own interactivity interface. (e.g. SVG interactivity).

So when the PDF was generated, the variable names were converted to string and the actual variable data might have been renamed.