Facing issues on extracting text from pdf file using java

Question

Not able to extract the text from the pdf which has Customer encryption fonts, which can identify by File -> Properties -> Font in Adobe reader. One of the font is mention as, C0EX02Q0_22 Type: Type 3 Encoding: Custom Actual Font: C0EX02Q0_22 Actual Font type: Type 3

Let me know is there any way to to extract the text content from such pdf files. Currently i am using PDFText2HTML from pdf util. Get the values like 'ÁÙÅ@ÅÕãÉ' while extracting such pdf files

Sample pdf: tesis completa.pdf

In this pdf you could see the fonts used having custom encoding Eg: T3Font_1 (Please refer by File -> Properties -> Font in Adobe reader) Since i could not upload the my pdf updated the sample one having same issue

I have edited the question with the sample pdf :antioxidants.udl.cat/Antioxidants_Research_Group/Tesis_files/… — TikTik

mkl mkl · Accepted Answer · 2014-01-22T14:55:50

Extraction as described in the standard

The PDF specification ISO 32000-1 describes in section 9.10 Extraction of Text Content how text extraction can be done if the PDF provides the required information and does so correctly.

Using this algorithm, though, only works in a few page ranges of the document (namely the summaries, the content lists, the thank-yous, and the section Publicación 7) but in the other ranges results in gibberish, e.g. 8QLYHUVLWDWGH/OHLGD instead of Universitat de Lleida. Looking at the PDF objects in question makes clear that the required information are missing (no ToUnicode map and while the Encoding is based on WinAnsiEncoding, all positions in use are mapped via Differences to non-standard names).

Also trying to extract the text using copy&paste from Adobe Reader returns that gibberish. This generally is a sign that generic extraction is not possible.

A work-around

Inspecting the PDF objects and the outputs of the generic text extraction attempt, though, gives rise to the idea that the actual encoding for the text extracted as gibberish is the same for all fonts used, and that it is some ASCII-based encoding shifted by a constant: Adding 'U' - '8' to each character of the extracted 8QLYHUVLWDWGH/OHLGD results in Universitat de Lleida. Adding the same constant to the chars from text extracted elsewhere in the document also results in correct text as long as the text only uses ASCII characters.

Characters outside the ASCII range are not mapped correctly by that simple method, but they also always seem to be extracted as the same wrong character, e.g. the glyph 'ó' always is extracted as 'y'.

Thus, you can extract the text from that (and similarly created) documents by first extracting the text using the standard algorithm and then in the gibberish sections (which probably can be identified by font name) replacing each character by adding 'U' - '8' for small values and by replacing according to some mapping for higher values.

As you mentioned Java in your question, I have run your document through iText and PDFBox text extraction with and without shifting by 'U' - '8', and the results look promising. I assume other general-purpose Java PDF libraries will also work.

Another work-around

Instead of creating custom extraction routines, you can try to fix the PDFs in question by adding ToUnicode map entries to the fonts in question. After that normal text extraction programs should be able to properly extract the contents.

Facing issues on extracting text from pdf file using java

1 Answers

Extraction as described in the standard

A work-around

Another work-around