Generating PDF from scratch, how are glyphs mapped to character codes?

Question

I want to generate a Portable Document Format (PDF) by an original program of mine. I am going to experiment an original typesetting program, and in the course of development I want to avoid external tools and fonts as far as possible. So, it would be ideal to avoid using XeTeX, LuaTeX, among other engines. And I want to store the glyph information internally in my program or my library. But where should the character code be specified in the PDF so that the viewer program knows when they are copied or searched?

To generate glyphs, my naive approach is to save, in local library, raster images or Bézier curve parameters that correspond to the characters. According to the PDF Reference, that seems well possible. I do not care for kerning, ligature, or other aesthetics virtues for my present purpose, or at least that can be dealt later.

Initially, I think I may generate a Postscript, and use Ghostscript to convert that to PDF. But it is pointed out here that Postscript does not support Unicode, which I will certainly use. My option is then reduced to directly generating PDF from scratch.

My confusion is, though my brute-force approach may render correctly, I guess the resulting PDF would be such that the viewer is unable to copy, nor search, since I would have specified nowhere about the character codes.

In PDF Reference p.122, we see that there are several different objects. What seems relevant are text objects, path objects, and image objects.

Is it possible to associate a image object to its character code? As I recall, there are some scanned PDF, for example the freely-previewed parts of scanned Google-Books, in which you can copy strings correctly. What is the method or field specifying that? However, I think in various tables that follows the PDF Reference, there is no suitable slot for Unicode code.

Similarly, it is not clear how to associate a path object to its character code. If this can be done, the envisioned project would be easiest, since I just extract out some open source fonts' Bézier curve parameters (I believe that can be done) and translate them myself to the PDF-allowed format.

If both image- and path-objects are impossible to hold character codes, I conclude that a text object is (obviously) more suitable for representing a glyph together with its character code. Maybe a more correct way would be embedding a custom font, synthesized in runtime, in the PDF. This is mentioned verbally and briefly in p.364, sec. 5.8, "Embedded Font Programs". That does seem rather difficult and requires tremendous research. I would like that you recommend some tutorials for embedding fonts, and they are not easy to find. In fact, I find exemplary PDF files are itself already scarce, as most of them seems to come in LZ-compressed binary files (I guess). Indeed, I try to compile a "Hello world" PDF in non-Computer-Modern font, and open with a text editor, and all I see is blanks, control characters, and Mojibake-like strings.

In summary, how do I (if possible) represent a glyph by a text object, image object, or a path object so that is character code can be known? For concreteness, can you generate a PDF so that: there is shown a circle, but when you copy that, you copy the character "A"?

As you want to create some program, I'd propose you simply look for existing PDF libraries for your runtime environment of choice. That would remove any transformation steps losing quality in one or the other way. — mkl
Does PostScript and Ghostscript result in losing of quality? Since luser droog points out that PostScript allows a Unicode encoding, seems PostScript is a good starting point, and PDF spec is more low-level than I thought of. — Violapterin
Well, it depends on what kind of pdfs you want to be able to create. I in particular don't know how much influence you can have on the structure elements for tagged pdfs and on annotations and attachments when going the postscript way. — mkl

luser droog luser droog · Accepted Answer · 2019-09-15T16:50:46

The association between the curves and the character code is the font. There are several tables involved that do the mappings. The font has an Encoding vector which is indexed by the character code and yields a Glyph name. For copying out of the document, there must also be a ToUnicode vector which maps to unicode code points.

If you study a simple example of a PostScript Type 3 font, that should be very beneficial in understanding a PDF font. I have a short one in this calendar program.

To answer the bold question, if you convert gridcal.ps to pdf, copying the moon glyph results in the character 1 because it is in the ascii position for 1 in the Encoding vector. Some other of the glyphs, notably sun, mars and venus are recognized by Ghostscript, which produces a mapping to the Unicode character. This is very clever, but probably not sufficiently extensive to rely upon (indeed, moon, mercury, jupiter and saturn are not recognized).

Generating PDF from scratch, how are glyphs mapped to character codes?

1 Answers