Getting text in PDF with toUnicode

Question

I am working in a PDF project, where I need to grab all text from the PDF. I've got some problem decoding Identity-H Font using toUnicode dictionary table provide from the PDF itself. the toUnicode provide character mapping to unicode hex, but didn't provide the uppercase CID character to unicode (in table).. So is there way that can lowercase the input unichar before process mapping to unicode using the table?

Can I using the offset between the <000C> <0042> to calculate the uppercase character?

toUnicode table .

57 beginbfchar
<0001> <0020>
<0002> <0021>
<0003> <0026>
<0004> <2019>
<0005> <002C>
<0006> <002D>
<0007> <002E>
<0008> <003A>
<0009> <003F>
<000A> <0040>
<000B> <0041>
<000C> <0042>
<000D> <0043>
<000E> <0044>
<000F> <0045>
<0010> <0046>
<0011> <0047>
<0012> <0048>
<0013> <0049>
<0014> <004A>
<0015> <004B>
<0016> <004C>
<0017> <004D>
<0018> <004F>
<0019> <0050>
<001A> <0052>
<001B> <0053>
<001C> <0054>
<001D> <0055>
<001E> <0057>
<001F> <0059>
<0020> <2018>
<0021> <0061>
<0022> <0062>
<0023> <0063>
<0024> <0064>
<0025> <0065>
<0026> <0066>
<0027> <0067>
<0028> <0068>
<0029> <0069>
<002A> <006A>
<002B> <006B>
<002C> <006C>
<002D> <006D>
<002E> <006E>
<002F> <006F>
<0030> <0070>
<0031> <0072>
<0032> <0073>
<0033> <0074>
<0034> <0075>
<0035> <0077>
<0036> <0079>
<0037> <007A>
<0038> <FB01>
<0039> <00FC>
endbfchar

the table did not provide glyph that mapping to uppercase Character. So how to show the character?

The table clearly given te mapping to unicode character values for uppercase latin characters. See code point 0B..1E. Probably due to font subsetting not all character codes are present i both the font and this mapping. — Ritsaert Hornstra
I have checked it only mapping to lowercase. So in example hex code point <02dd> should have mapped to D, but it didn't provided in the unicode table. But the weird thing is why I can search the text in Preview (Mac pdf app) then I think the table should have provided it, or somehow using different method to grab the text ?? — Lunayo
And can you explain "See code point 0B..1E". I don't see any uppercase mapping?? — Lunayo
Codepoint <000B> maps to unicode character <0041> which is CAPITAL A. If that isn't uppercae I don't know anymore — Ritsaert Hornstra
@RitsaertHornstra yes you are right, but why the unichar from PDF(before mapping) is <02DD> (what is really will map into Capital letters T) cannot be found in the table. instead <001C> that will mapped to capital letter T <0054> — Lunayo

Lunayo Lunayo · Accepted Answer · 2011-10-27T09:33:45

I Solved the problem, the problem is in CGPDFStringCopyTextString(). this method get the string from CGPDFStringRef got some weird bytes that I didn't want. So instead of that I tried get the byte manual by using

NSMutableString *unicodeString = [NSMutableString string];
    for (NSUInteger i = 0; i < [data length]; i++) {
        unsigned char byte;
        [data getBytes:&byte range:NSMakeRange(i, 1)];
        unichar unicodeChar = byte;
        [unicodeString appendFormat:@"%c",unicodeChar];
    }
return unicodeString;

Getting text in PDF with toUnicode

1 Answers