I'm trying to mine some text from a bunch of PDFs and a few of them have embedded CID fonts in the output:
(cid:80)(cid:72)(cid:87)(cid:68)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3)
(cid:177)(cid:3)(cid:71)(cid:72)(cid:191)(cid:81)(cid:72)(cid:71)(cid:3)(cid:69)(cid:92
(cid:3)(cid:56)(cid:49)(cid:3)(cid:43)(cid:68)(cid:69)(cid:76)(cid:87)(cid:68)(cid:87)
(cid:3)(cid:68)(cid:86)(cid:3)(cid:70)(cid:76)(cid:87)(cid:76)(cid:72)(cid:86)(cid:3)
(cid:90)(cid:76)(cid:87)(cid:75)(cid:3)(cid:80)(cid:82)(cid:85)(cid:72)(cid:3)(cid:87)
(cid:75)(cid:68)(cid:81)(cid:3)(cid:20)(cid:19)(cid:3)
When I look at that exact snippet of text in the PDF, the letters are certainly convertible to ASCII:
This probably suggests that a brute force decoding would work (i.e. read a snippet of text that corresponds with a bunch of CID codes and create a mapping that way), but will this be reliable across lots of different PDFs? Is there a reliable mapping from these CID codes to ASCII characters or will that be highly dependent on the font in the PDF? How can I determine what ASCII character the a CID code like (cid:72)
corresponds with?
For what its worth, I'm extracting the text using PDFminer, which appears to be the only tool that actually reports the CID codes. If there is a better tool out there for converting PDFs to HTML or any other parsable text format, I'm open to other suggestions!
As an added bonus, this question appears to be related to a few other unanswered questions, so there is a rich bounty of reputation on the line here: