0
votes

I am currently using wicked_pdf(wkhtmltopdf) to create pdf files from html. But, I am not able to copy/paste the content from pdf properly. After looking through the web, i am guessing that problem might be that pdf doesn't contain 'to unicode' map for matching the glyphs back to unicode.

Example pdf : https://github.com/wkhtmltopdf/wkhtmltopdf/files/611265/sample.pdf

First line in the pdf : वे ब चे कूल नह जाते थे। पूरा दन मैदान म घूमते थे।

Many of the variations are lost while copying. What might be the issue here?

Also, is there anyway to check if 'to unicode' map exists in a pdf file?.

Also, how can I generate a pdf file properly with 'to unicode' map, using wkhtmltopdf?.

1

1 Answers

0
votes

Unfortunately I can't tell you how to fix your issue, but..

The sample PDF does have a ToUnicode property as seen in the source

<< /Type /Font
/Subtype /TrueType
/BaseFont /WHROBO+NotoSansDevanagari
/FirstChar 32
/LastChar 51
/FontDescriptor 14 0 R
/Encoding /WinAnsiEncoding
/Widths [ 259 0 0 0 0 0 0 0 0 0 0 0 0 0 268 0 0 0 0 550 ]
/ToUnicode 12 0 R
>>

ToUnicode points to:

12 0 obj
<< /Length 13 0 R
   /Filter /FlateDecode
>>
stream
  ...
endstream
endobj

This stream doesn't seem to be long enough though and the widths aren't set in the Widths property in the font definition (or the chars just aren't included). When I ran the single line sample you provided through docca.io I got:

<< /Type /Font
/Subtype /TrueType
/Name /F1
/BaseFont /DOCCAA+NotoSansDevanagari
/Encoding /MacRomanEncoding
/FontDescriptor 7 0 R
/FirstChar 32
/LastChar 62
/Widths [260 551 551 551 551 551 551 551 551 551 551 762 591 634 742 570 642 520 555 568 571 598 409 678 556 531 259 488 488 488 379]
/ToUnicode 8 0 R
>>

8 0 obj
<< /Length 347
/Filter /FlateDecode
/Length1 667 >>
stream
  ...
endstream
endobj

so a much longer char map even though it had far less characters.

Out of interest, has this rendered correctly? It looks a little different to me to your sample text, but I don't read Devanagari 8). pdf rendered in chrome