1
votes

I have a pdf file which can not be extracted text by pdfbox or itext7. The font is encoded by Identity-H with Adobe-Identity-UCS. The details of ToUnicode are given below.


    /CIDInit /ProcSet findresource begin

    12 dict begin

    begincmap

    /CIDSystemInfo > def
    /CMapName /Adobe-Identity-UCS def
    /CMapType 2 def

    1 begincodespacerange
    <0000><FFFF>
    endcodespacerange

    endcmap
    CMapName currentdict /CMap defineresource pop
    end
    end

The ToUnicode is invalid. Is there any way to fixed it?

I tried to download an intact Adobe-Identity-UCS cmap file and to replace it. But after a lot of google searching, I can't find the Adobe-Identity-UCS cmap file.

Any help? Thanks.

Edit:

Chinese-cidmap-broken.pdf

1
@TilmanHausherr. Thanks. I know the way to rewrite ToUnicode, but can not find Adobe-Identity-UCS cmap.KlSoft
@KlSoft can you post a download link for the PDF file?Mihai Iancu
@MihaiIancu sure. Thanks.KlSoft
@KlSoft I thought that maybe the file used the character Unicode codes as glyph IDs and then used this generic cmap name without filling the actual cmap, but that is not the case here.Mihai Iancu

1 Answers

4
votes

The ToUnicode CMap you show corresponds to the example ToUnicode CMap in the PDF specification ISO 32000 (either part), merely without any bfrange or bfchar section.

Thus, what you have essentially is a template into which one can put arbitrary mappings.

Concerning your question, therefore:

Is there any way to fixed it?

Yes and no.

Yes, you can fix it by adding the appropriate bfrange or bfchar sections with the correct mappings.

BUT... to do so you need to know which codes map to which Unicode strings respectively for the font at hand, the name Adobe-Identity-UCS by itself usually does not imply the mapping. So also:

No, not without additional information.

@Tilman in his comment to your question referenced one of his answers in which he showed how to add a missing ToUnicode map using information on the actual mappings gathered from different sources.