7
votes

I'm trying to mine some text from a bunch of PDFs and a few of them have embedded CID fonts in the output:

(cid:80)(cid:72)(cid:87)(cid:68)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3)
(cid:177)(cid:3)(cid:71)(cid:72)(cid:191)(cid:81)(cid:72)(cid:71)(cid:3)(cid:69)(cid:92
(cid:3)(cid:56)(cid:49)(cid:3)(cid:43)(cid:68)(cid:69)(cid:76)(cid:87)(cid:68)(cid:87)
(cid:3)(cid:68)(cid:86)(cid:3)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3)
(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:80)(cid:82)(cid:85)(cid:72)(cid:3)(cid:87)
(cid:75)(cid:68)(cid:81)(cid:3)(cid:20)(cid:19)(cid:3)

When I look at that exact snippet of text in the PDF, the letters are certainly convertible to ASCII:

screenshot of corresponding portion of pdf

This probably suggests that a brute force decoding would work (i.e. read a snippet of text that corresponds with a bunch of CID codes and create a mapping that way), but will this be reliable across lots of different PDFs? Is there a reliable mapping from these CID codes to ASCII characters or will that be highly dependent on the font in the PDF? How can I determine what ASCII character the a CID code like (cid:72) corresponds with?

For what its worth, I'm extracting the text using PDFminer, which appears to be the only tool that actually reports the CID codes. If there is a better tool out there for converting PDFs to HTML or any other parsable text format, I'm open to other suggestions!

As an added bonus, this question appears to be related to a few other unanswered questions, so there is a rich bounty of reputation on the line here:

1
hi, were you able to do this? I have a pdf file that spits out cid values and cannot be interpreted. Is there a way to translate into human readable text using python? looking at the adobe documentation shows some postscript stuff I cant understand? is there an easier solution?Spencer Trinh

1 Answers

6
votes

While you can probably do this by guesswork for the simple example here, to really do it correctly you'll need 2 additional pieces of information:

1) The Registry-Ordering-Supplement (ROS) information for the font in question. This will usually be something like 'Adobe-Japan1-5' or some such and is an informational property stored in the font. The ROS determines how the CIDs are to be interpreted. A given CID in one font is not necessarily the same as a CID in another font, unless the ROSes are the same. That is to say: CID12345 in Adobe-Japan1-5 is not the same shape as CID12345 in Adobe-GB1-3!

2) Armed with the ROS info, select a compatible CMap and decode through that. ASCII is a bit short-sighted; I would go with Unicode of which ASCII is a subset. You can find CMap files for the Adobe-defined ROSes at https://github.com/adobe-type-tools/cmap-resources

More information on CID and CMaps direct from the inventors is available at http://www.adobe.com/content/dam/Adobe/en/devnet/font/pdfs/5014.CIDFont_Spec.pdf