Without embeded fonts, is PDF limited to only 4281 characters (of AGL)? How to display more glyphs?

Question

is a mapping of 4,281 glyph names to one or more Unicode characters.

For what I understand those are PDF Names like /Adieresis allow to specify the respective unicode character U+00C4 and if my understanding is correct those 4,281 Names can be used to specify a mapping like done here for the font named /F1 in a pages /Resources dictionary:

<<
/Type /Page
/Resources <<
  /Font <<
    /F1 <<
      /Type /Font    
      /Subtype /Type1
      /BaseFont /Times-Roman
      /Encoding <<
        /Differencs [ 1 /Adiaresis /adiaresis ] 
      >>
    >>
  >>
>>

The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?

Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?

Also I am confused that there is a /toUnicode feature in PDF allowing to associate glyphs/cmaps of embedded fonts with the unicode characters they those glyphs should represent (hence there was some thinking about "unicode") yet I cannot seem to find the way to use any reasonable unicode codepoints or half-way working encoding (i.e. UTF-8) to make use of the built-in fonts in PDF.

So am is my assumption correct that without going the length to generate a font to embed within a pdf file, the text can only ever be at most from the set of those 4,281 characters only?

In order to support all 65,557 characters within Unicode's Basic Multilingual Plane, it would be required to generate a font containing the used glyphs in the text, since except those 4,281 AGL glyph there seems to be no way to reference to those unicode characters, correct?

Motivation

It would be nice to have a way in PDF that would be the equivalent to HTML5's <meta charset="utf-8">. Allowing text to be encoded in one simple compatible encoding for unicode, and not having to deal with complicated things as CID/GID/Postscript Glyph Names etc.

Even if you want to restrict yourself to non-embedded font programs, there still are numerous named font encodings for which you don't need the name if each glyph, in particular for CJK. — mkl
@mkl with regards to the number of glyphs that are contained in the encoding, is it not that StandardEncoding, MacRomanEncoding, WinAnsiEncoding and BaseEncoding use 1 byte per glyph, hence reference even less than 4281 characters, indeed those encoding -as with the respect to the question- seem at best to be able to modified to include any of those 4281 characters. I will however research better the CJK. (if only they would have had UTF8- Unicode, like sensible people ;) — humanityANDpeace
You should not restrict yourself to simple fonts. Also look at Composite Fonts. Here the Encoding shall be The name of a predefined CMap, or a stream containing a CMap that maps character codes to font numbers and CIDs. If the descendant is a Type 2 CIDFont whose associated TrueType font program is not embedded in the PDF file, the Encoding entry shall be a predefined CMap name (see 9.7.4.2, "Glyph Selection in CIDFonts"). And among the predefined CMap there are numerous CJK ones. — mkl
Great question, but curious why covering the Unicode BMP using a non-embedded font is important for you? You need a single font that covers the entire BMP? Assuming this is not the case, why not just embed the font? Embedding the font ensures the PDF looks the same everywhere, regardless of device+OS+viewer. — Ryan
@Ryan. with the average glyph being composed of 50-60 bytes of data and there being 2^16 glyphs in BMP a file to cover all would add some 3MB+ to each pdf, while I expected the 14 "standard" fonts to be a way to avoid that, while being sort of guaranteed a consistent appearance. Actual font covering all BMP glyphs was 8MB (but it might be able to compress it somewhat) — humanityANDpeace

Ryan Ryan · Accepted Answer · 2019-08-06T07:11:02

Without embeded fonts, is PDF limited to only 4281 characters (of AGL)?

No. Though you should embed fonts to help ensure that the PDF looks the same everywhere.

Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?

It is possible yes, though you would ideally stick with a "standard" encoding, such one of the Orderings. See the "Predefined CMaps" in the PDF specification for these.

If you start making changes to the encoding, such as using Differences, then you are making run time font substitution for the PDF processing program much more difficult.

Regarding /ToUnicode that is just for text extraction, and has nothing to do with rendering. If you stick with a standard encoding as recommended above this is not needed.

Without embeded fonts, is PDF limited to only 4281 characters (of AGL)? How to display more glyphs?

3 Answers

AGL names and Differences arrays

Encodings of composite fonts

Which fonts does a viewer have to have available?

Concerning your clarifications