I'm extracting text from a PDF document. This PDF was generated using a WS reading Data From AS400 . So when printing text, the output is like :
orem ipsum dolor sit amet, **«VS123»** In eros risus, «VS124» sed felis quis, commodo interdum tellus. Donec vitae massa
And «VS123» , «VS124» are variables from AS400.The Java APi is not able to read Value from variable and its printing Variable name instead of variable values.
I'm using PDFBox https://pdfbox.apache.org/ to extract text. The code source is like :
import java.io.File;
import java.io.IOException;
import java.util.List;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDDocumentCatalog;
import org.apache.pdfbox.pdmodel.encryption.InvalidPasswordException;
import org.apache.pdfbox.pdmodel.interactive.form.PDAcroForm;
import org.apache.pdfbox.pdmodel.interactive.form.PDField;
import org.apache.pdfbox.pdmodel.interactive.form.PDNonTerminalField;
import org.apache.pdfbox.text.PDFTextStripper;
import org.apache.pdfbox.text.PDFTextStripperByArea;
public class App
{
public static void main( String[] args ) throws IOException
{
try (PDDocument document = PDDocument.load(new File("C:/my.pdf"))) {
document.getClass();
if (!document.isEncrypted()) {
PDFTextStripperByArea stripper = new PDFTextStripperByArea();
stripper.setSortByPosition(true);
PDFTextStripper tStripper = new PDFTextStripper();
String pdfFileInText = tStripper.getText(document);
// split by whitespace
String lines[] = pdfFileInText.split("\\r?\\n");
for (String line : lines) {
System.out.println(line);
}
document.close();
}
}
}
}
The output starts with this stack of error :
AVERTISSEMENT: Invalid ToUnicode CMap in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+77 (77) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+111 (111) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+110 (110) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+116 (116) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+97 (97) in font ArialMT nov. 16, 2017 8:08:24 PM org.apache.pdfbox.pdmodel.font.PDType0Font toUnicode AVERTISSEMENT: No Unicode mapping for CID+32 (32) in font ArialMT
I'tried also to exract text using iText :
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfTextExtractor;
import java.io.IOException;
public class App {
private static final String FILE_NAME = "C:/my.pdf";
public static void main(String[] args) {
PdfReader reader;
try {
reader = new PdfReader(FILE_NAME);
String textFromPage = PdfTextExtractor.getTextFromPage(reader, 1);
System.out.println(textFromPage);
reader.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
Here is the part of the PDF document :
When Tryin to extract text, or with Copy-paste, The output will be this :
CLIENT N° «VS35» « VS36 » CONTRAT N° «VS28»
The link to the PDF File: https://drive.google.com/file/d/1RNea028nCReIVS8nRWNlBwUwBsDOhDYg/view?usp=sharing