0
votes

Adobe Glyph List (AGL) is described as

is a mapping of 4,281 glyph names to one or more Unicode characters.

For what I understand those are PDF Names like /Adieresis allow to specify the respective unicode character U+00C4 and if my understanding is correct those 4,281 Names can be used to specify a mapping like done here for the font named /F1 in a pages /Resources dictionary:

<<
/Type /Page
/Resources <<
  /Font <<
    /F1 <<
      /Type /Font    
      /Subtype /Type1
      /BaseFont /Times-Roman
      /Encoding <<
        /Differencs [ 1 /Adiaresis /adiaresis ] 
      >>
    >>
  >>
>>

The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?

Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?

Also I am confused that there is a /toUnicode feature in PDF allowing to associate glyphs/cmaps of embedded fonts with the unicode characters they those glyphs should represent (hence there was some thinking about "unicode") yet I cannot seem to find the way to use any reasonable unicode codepoints or half-way working encoding (i.e. UTF-8) to make use of the built-in fonts in PDF.

So am is my assumption correct that without going the length to generate a font to embed within a pdf file, the text can only ever be at most from the set of those 4,281 characters only?

In order to support all 65,557 characters within Unicode's Basic Multilingual Plane, it would be required to generate a font containing the used glyphs in the text, since except those 4,281 AGL glyph there seems to be no way to reference to those unicode characters, correct?

Motivation

It would be nice to have a way in PDF that would be the equivalent to HTML5's <meta charset="utf-8">. Allowing text to be encoded in one simple compatible encoding for unicode, and not having to deal with complicated things as CID/GID/Postscript Glyph Names etc.

3
Even if you want to restrict yourself to non-embedded font programs, there still are numerous named font encodings for which you don't need the name if each glyph, in particular for CJK. - mkl
@mkl with regards to the number of glyphs that are contained in the encoding, is it not that StandardEncoding, MacRomanEncoding, WinAnsiEncoding and BaseEncoding use 1 byte per glyph, hence reference even less than 4281 characters, indeed those encoding -as with the respect to the question- seem at best to be able to modified to include any of those 4281 characters. I will however research better the CJK. (if only they would have had UTF8- Unicode, like sensible people ;) - humanityANDpeace
You should not restrict yourself to simple fonts. Also look at Composite Fonts. Here the Encoding shall be The name of a predefined CMap, or a stream containing a CMap that maps character codes to font numbers and CIDs. If the descendant is a Type 2 CIDFont whose associated TrueType font program is not embedded in the PDF file, the Encoding entry shall be a predefined CMap name (see 9.7.4.2, "Glyph Selection in CIDFonts"). And among the predefined CMap there are numerous CJK ones. - mkl
Great question, but curious why covering the Unicode BMP using a non-embedded font is important for you? You need a single font that covers the entire BMP? Assuming this is not the case, why not just embed the font? Embedding the font ensures the PDF looks the same everywhere, regardless of device+OS+viewer. - Ryan
@Ryan. with the average glyph being composed of 50-60 bytes of data and there being 2^16 glyphs in BMP a file to cover all would add some 3MB+ to each pdf, while I expected the 14 "standard" fonts to be a way to avoid that, while being sort of guaranteed a consistent appearance. Actual font covering all BMP glyphs was 8MB (but it might be able to compress it somewhat) - humanityANDpeace

3 Answers

1
votes

Without embeded fonts, is PDF limited to only 4281 characters (of AGL)?

No. Though you should embed fonts to help ensure that the PDF looks the same everywhere.

Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?

It is possible yes, though you would ideally stick with a "standard" encoding, such one of the Orderings. See the "Predefined CMaps" in the PDF specification for these.

If you start making changes to the encoding, such as using Differences, then you are making run time font substitution for the PDF processing program much more difficult.

Regarding /ToUnicode that is just for text extraction, and has nothing to do with rendering. If you stick with a standard encoding as recommended above this is not needed.

1
votes

This answer first discusses the use of non-AGL names in differences arrays and the more encompassing encodings of composite fonts. Then it discusses which fonts a viewer actually does have to have available. Finally it considers all this in light of the clarifications accompanying your bounty offer.

AGL names and Differences arrays

First let's consider the focal point of your original question,

The key issue, which I cannot get to wrap my head around is that via the /Differences Array and the predefined AGL names I would only be able to use those 4,281 glyphs/characters from the base/builtin/standard set of PDF fonts, wouldn't I?

Basically what I am asking is whether it is correct that to display text containing any character not included in those 4,281 AGL characters, would be impossible without embedding those glyphs into the produced pdf?

i.e. your assumption is that only those 4,281 AGL glyph names can be used in the Differences array of the encoding entry of a simple font.

This is not the case, you can also use arbitrary names not found on the AGL. E.g. using this font

7 0 obj
<<
/Type /Font
/Subtype /TrueType
/BaseFont /Arial
/FirstChar 32
/LastChar 32
/Widths [500]
/FontDescriptor 8 0 R
/Encoding 9 0 R
>>
endobj
8 0 obj
<<
/Type /FontDescriptor
/FontName /Arial
/FontFamily (Arial)
/Flags 32
/FontBBox [-665.0 -325.0 2000.0 1040.0]
/ItalicAngle 0
/Ascent 1040
/Descent -325
/CapHeight 716
/StemV 88
/XHeight 519
>>
endobj
9 0 obj
<<
/Type /Encoding
/BaseEncoding /WinAnsiEncoding
/Differences [32 /uniAB55]
>>
endobj

the instruction

( ) Tj

shows you a ('LATIN SMALL LETTER CHI WITH LOW LEFT SERIF' U+AB55 which if I saw correctly is not on the AGL) on a system with Arial (ArialMT.ttf) installed.

Thus, to display an arbitrary glyph, you merely need a font you know containing that glyph with a name you know available to the PDF viewer in question. The name doesn't have to be an AGL name, it can be arbitrary!

Encodings of composite fonts

Furthermore, you often aren't even required to enumerate the characters you need as long as your required characters are in the same named encoding for composite fonts!

Here the Encoding shall be

The name of a predefined CMap, or a stream containing a CMap that maps character codes to font numbers and CIDs. If the descendant is a Type 2 CIDFont whose associated TrueType font program is not embedded in the PDF file, the Encoding entry shall be a predefined CMap name (see 9.7.4.2, "Glyph Selection in CIDFonts").

And among the predefined CMaps there are numerous CJK ones. As long as the viewer in question has access to a matching font, you can use a composite font with such an encoding to get access to a lot of CJK glyphs.

Which fonts does a viewer have to have available?

Thus, if the viewer in question has appropriate fonts available, you don't need to embed font programs to display any glyph. But which fonts does a viewer have available?

Usually a viewer will allow access to all fonts registered with the operation system it is running on, but strictly speaking it only has to have very few fonts accessible, PDF processors supporting PDF 1.0 to PDF 1.7 files only need to know the so-called standard 14 fonts and pure PDF 2.0 processors need to know none.

Annex D of the specification clarifies the character ranges to support:

All characters listed in D.2, "Latin character set and encodings" shall be supported for the Times, Helvetica, and Courier font families, as listed in 9.6.2.2, "Standard Type 1 fonts (standard 14 fonts) (PDF 1.0-1.7)" by a PDF processor that supports PDF 1.0 to 1.7.

D.4, "Symbol set and encoding" and D.5, "ZapfDingbats set and encoding" describe the character sets and built-in encodings for the Symbol and ZapfDingbats (ITC Zapf Dingbats) font programs, which belong to the standard 14 predefined fonts.

D.2 essentially is a table describing the StandardEncoding, MacRomanEncoding, WinAnsiEncoding, and PDFDocEncoding. These all are very similar single byte encodings.

D.4 and D.5 contain a single table each describing additional single byte encodings.

Thus, all you can actually expect from a PDF 1.x viewer are these less than 1000 characters!

(You wondered about this in comments to this answer to another question of yours.)

Concerning your clarifications

In your text accompanying your bounty offer you expressed a desire for

being enabled to create a "no frills" program that is able to generate pdf files, where the input data are UTF-8 unicode strings. "No frills" being a reference to the fact that such a software would ideally be able to skip handling font porgam data (such as createing a subset font pogram for inclusion into the pdf).

As explained above, you can do so, either by customized encodings of a number of simple fonts or by the more encompassing named encodings of composite fonts. If you know that the target PDF viewer has these fonts available, that is!

sketch a way that actually would allow to have characters from at least the Adobe-GB1 charset as referenced via "UniCNS−UTF16−H" to be rendered in pdf-viewers, while the pdf file not having any font program embedded for that achieving this.

"UniCNS−UTF16−H" just happens to be one of the predefined encodings allowable for composite fonts. Thus, you can use a composite font with this encoding without embedding the font program as long as the viewer has the appropriate fonts accessible. As far as Adobe Reader is concerned, this usually amounts to having the Extended Asian Language Pack installed.

the limitations to use anything else the WinAnsiEncoding, MacRomanEncoding, MacExpertEncoding with those 14 standard fonts.

As explained above you can merely count on less than 1000 glyphs being available for sure in an arbitrary PDF 1.x viewer. In a pure PDF 2.0 viewer you actually cannot count on even that!


The specification quotes above are from ISO 32000-2; similar requirements can already be found in ISO 32000-1.

0
votes

There is no 4,281 font glyph limit inherent in PDF. I think you are a bit confused, you don't have to embed fonts in a PDF. Besides the Standard 14 fonts all PDF viewers should be able to handle, PDF software is going to look for fonts installed on the system when not embedded otherwise so it's not as if you have no embedded fonts you lose the ability to display glyphs at all.

You would define a different encoding with the Differences array if the base encoding doesn't reflect what is in the font.

ToUnicode comes into play for text extraction vs text showing.