Extraction as described in the standard
The PDF specification ISO 32000-1 describes in section 9.10 Extraction of Text Content how text extraction can be done if the PDF provides the required information and does so correctly.
Using this algorithm, though, only works in a few page ranges of the document (namely the summaries, the content lists, the thank-yous, and the section Publicación 7) but in the other ranges results in gibberish, e.g. 8QLYHUVLWDWGH/OHLGD
instead of Universitat de Lleida
. Looking at the PDF objects in question makes clear that the required information are missing (no ToUnicode map and while the Encoding is based on WinAnsiEncoding, all positions in use are mapped via Differences to non-standard names).
Also trying to extract the text using copy&paste from Adobe Reader returns that gibberish. This generally is a sign that generic extraction is not possible.
A work-around
Inspecting the PDF objects and the outputs of the generic text extraction attempt, though, gives rise to the idea that the actual encoding for the text extracted as gibberish is the same for all fonts used, and that it is some ASCII-based encoding shifted by a constant: Adding 'U' - '8'
to each character of the extracted 8QLYHUVLWDWGH/OHLGD
results in Universitat de Lleida
. Adding the same constant to the chars from text extracted elsewhere in the document also results in correct text as long as the text only uses ASCII characters.
Characters outside the ASCII range are not mapped correctly by that simple method, but they also always seem to be extracted as the same wrong character, e.g. the glyph 'ó' always is extracted as 'y'.
Thus, you can extract the text from that (and similarly created) documents by first extracting the text using the standard algorithm and then in the gibberish sections (which probably can be identified by font name) replacing each character by adding 'U' - '8'
for small values and by replacing according to some mapping for higher values.
As you mentioned Java in your question, I have run your document through iText and PDFBox text extraction with and without shifting by 'U' - '8'
, and the results look promising. I assume other general-purpose Java PDF libraries will also work.
Another work-around
Instead of creating custom extraction routines, you can try to fix the PDFs in question by adding ToUnicode map entries to the fonts in question. After that normal text extraction programs should be able to properly extract the contents.