I want to generate a Portable Document Format (PDF) by an original program of mine. I am going to experiment an original typesetting program, and in the course of development I want to avoid external tools and fonts as far as possible. So, it would be ideal to avoid using XeTeX, LuaTeX, among other engines. And I want to store the glyph information internally in my program or my library. But where should the character code be specified in the PDF so that the viewer program knows when they are copied or searched?
To generate glyphs, my naive approach is to save, in local library, raster images or Bézier curve parameters that correspond to the characters. According to the PDF Reference, that seems well possible. I do not care for kerning, ligature, or other aesthetics virtues for my present purpose, or at least that can be dealt later.
Initially, I think I may generate a Postscript, and use Ghostscript to convert that to PDF. But it is pointed out here that Postscript does not support Unicode, which I will certainly use. My option is then reduced to directly generating PDF from scratch.
My confusion is, though my brute-force approach may render correctly, I guess the resulting PDF would be such that the viewer is unable to copy, nor search, since I would have specified nowhere about the character codes.
In PDF Reference p.122, we see that there are several different objects. What seems relevant are text objects, path objects, and image objects.
Is it possible to associate a image object to its character code? As I recall, there are some scanned PDF, for example the freely-previewed parts of scanned Google-Books, in which you can copy strings correctly. What is the method or field specifying that? However, I think in various tables that follows the PDF Reference, there is no suitable slot for Unicode code.
Similarly, it is not clear how to associate a path object to its character code. If this can be done, the envisioned project would be easiest, since I just extract out some open source fonts' Bézier curve parameters (I believe that can be done) and translate them myself to the PDF-allowed format.
If both image- and path-objects are impossible to hold character codes, I conclude that a text object is (obviously) more suitable for representing a glyph together with its character code. Maybe a more correct way would be embedding a custom font, synthesized in runtime, in the PDF. This is mentioned verbally and briefly in p.364, sec. 5.8, "Embedded Font Programs". That does seem rather difficult and requires tremendous research. I would like that you recommend some tutorials for embedding fonts, and they are not easy to find. In fact, I find exemplary PDF files are itself already scarce, as most of them seems to come in LZ-compressed binary files (I guess). Indeed, I try to compile a "Hello world" PDF in non-Computer-Modern font, and open with a text editor, and all I see is blanks, control characters, and Mojibake-like strings.
In summary, how do I (if possible) represent a glyph by a text object, image object, or a path object so that is character code can be known? For concreteness, can you generate a PDF so that: there is shown a circle, but when you copy that, you copy the character "A"?
luser droogpoints out that PostScript allows a Unicode encoding, seems PostScript is a good starting point, and PDF spec is more low-level than I thought of. - Violapterin