Handle many unicode caracters with PDFBox

Question

I am writing a Java function which takes a String as a parameter and produce a PDF as an output with PDFBox.

Everything is working fine as long as I use latin characters. However, I don't know in advance what will be the input, and it might be some English as well as Chinese or Japanese characters.

In the case of non latin characters, here is the error I get:

Exception in thread "main" java.lang.IllegalArgumentException: U+3053 ('kohiragana') is not available in this font Helvetica encoding: WinAnsiEncoding
at org.apache.pdfbox.pdmodel.font.PDType1Font.encode(PDType1Font.java:426)
at org.apache.pdfbox.pdmodel.font.PDFont.encode(PDFont.java:324)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showTextInternal(PDPageContentStream.java:509)
at org.apache.pdfbox.pdmodel.PDPageContentStream.showText(PDPageContentStream.java:471)
at com.mylib.pdf.PDFBuilder.generatePdfFromString(PDFBuilder.java:122)
at com.mylib.pdf.PDFBuilder.main(PDFBuilder.java:111)

If I understand correctly, I have to use a specific font for Japanese, another one for Chinese and so on, because the one that I am using (Helvetiva) doesn't handle all required unicode characters.

I could also use a font which handle all these unicode characters, such as Arial Unicode. However this font is under a specific license so I cannot use it and I haven't found another one.

I found some projects that want to overcome this issue, like the Google NOTO project. However, this project provides multiple font files. So I would have to choose, at runtime, the correct file to load depending on the input I have.

So I am facing 2 options, one of which I don't know how to implement properly:

Keep searching for a font that handle almost every unicode character (where is this grail I am desperately seeking?!)
Try to detect which language is used and select a font depending on it. Despite the fact that I don't know (yet) how to do that, I don't find it to be a clean implementation, as the mapping between the input and the font file will be hardcoded, meaning I will have to hardcode all the possible mappings.
Is there another solution?
Am I completely off tracks?

Thanks in advance for your help and guidance!

Here is the code I use to generate the PDF:

public static void main(String args[]) throws IOException {
    String latinText = "This is latin text";
    String japaneseText = "これは日本語です";

    // This works good
    generatePdfFromString(latinText);

    // This generate an error
    generatePdfFromString(japaneseText);
}

private static OutputStream generatePdfFromString(String content) throws IOException {
    PDPage page = new PDPage();

    try (PDDocument doc = new PDDocument();
         PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
        doc.addPage(page);
        contentStream.setFont(PDType1Font.HELVETICA, 12);

        // Or load a specific font from a file
        // contentStream.setFont(PDType0Font.load(this.doc, new File("/fontPath.ttf")), 12);

        contentStream.beginText();
        contentStream.showText(content);
        contentStream.endText();
        contentStream.close();
        OutputStream os = new ByteArrayOutputStream();
        doc.save(os);
        return os;
    }
}

mkl mkl · Accepted Answer · 2018-07-30T14:37:31

A better solution than waiting for a font or guessing a text's language is to have a multitude of fonts and selecting the correct font on a glyph-by-glyph base.

You already found the Google Noto Fonts which are a good base collection of fonts for this task.

Unfortunately, though, Google publishes the Noto CJK fonts only as OpenType fonts (.otf), not as TrueType fonts (.ttf), a policy that isn't likely to change, cf. the Noto fonts issue 249 and others. On the other hand PDFBox does not support OpenType fonts and isn't actively working on OpenType support either, cf. PDFBOX-2482.

Thus, one has to convert the OpenType font somehow to TrueType. I simply took the file shared by djmilch in his blog post FREE FONT NOTO SANS CJK IN TTF.

Font selection per character

So you essentially need a method which checks your text character by character and dissects it into chunks which can be drawn using the same font.

Unfortunately I don't see a better method to ask a PDFBox PDFont whether it knows a glyph for a given character than to actually try to encode the character and consider a IllegalArgumentException a "no".

I, therefore, implemented that functionality using the following helper class TextWithFont and method fontify:

class TextWithFont {
    final String text;
    final PDFont font;

    TextWithFont(String text, PDFont font) {
        this.text = text;
        this.font = font;
    }

    public void show(PDPageContentStream canvas, float fontSize) throws IOException {
        canvas.setFont(font, fontSize);
        canvas.showText(text);
    }
}

(AddTextWithDynamicFonts inner class)

List<TextWithFont> fontify(List<PDFont> fonts, String text) throws IOException {
    List<TextWithFont> result = new ArrayList<>();
    if (text.length() > 0) {
        PDFont currentFont = null;
        int start = 0;
        for (int i = 0; i < text.length(); ) {
            int codePoint = text.codePointAt(i);
            int codeChars = Character.charCount(codePoint);
            String codePointString = text.substring(i, i + codeChars);
            boolean canEncode = false;
            for (PDFont font : fonts) {
                try {
                    font.encode(codePointString);
                    canEncode = true;
                    if (font != currentFont) {
                        if (currentFont != null) {
                            result.add(new TextWithFont(text.substring(start, i), currentFont));
                        }
                        currentFont = font;
                        start = i;
                    }
                    break;
                } catch (Exception ioe) {
                    // font cannot encode codepoint
                }
            }
            if (!canEncode) {
                throw new IOException("Cannot encode '" + codePointString + "'.");
            }
            i += codeChars;
        }
        result.add(new TextWithFont(text.substring(start, text.length()), currentFont));
    }
    return result;
}

(AddTextWithDynamicFonts method)

Example use

Using the method and the class above like this

String latinText = "This is latin text";
String japaneseText = "これは日本語です";
String mixedText = "Tこhれiはs日 本i語sで すlatin text";

generatePdfFromStringImproved(latinText).writeTo(new FileOutputStream("Cccompany-Latin-Improved.pdf"));
generatePdfFromStringImproved(japaneseText).writeTo(new FileOutputStream("Cccompany-Japanese-Improved.pdf"));
generatePdfFromStringImproved(mixedText).writeTo(new FileOutputStream("Cccompany-Mixed-Improved.pdf"));

(AddTextWithDynamicFonts test testAddLikeCccompanyImproved)

ByteArrayOutputStream generatePdfFromStringImproved(String content) throws IOException {
    try (   PDDocument doc = new PDDocument();
            InputStream notoSansRegularResource = AddTextWithDynamicFonts.class.getResourceAsStream("NotoSans-Regular.ttf");
            InputStream notoSansCjkRegularResource = AddTextWithDynamicFonts.class.getResourceAsStream("NotoSansCJKtc-Regular.ttf")   ) {
        PDType0Font notoSansRegular = PDType0Font.load(doc, notoSansRegularResource);
        PDType0Font notoSansCjkRegular = PDType0Font.load(doc, notoSansCjkRegularResource);
        List<PDFont> fonts = Arrays.asList(notoSansRegular, notoSansCjkRegular);

        List<TextWithFont> fontifiedContent = fontify(fonts, content);

        PDPage page = new PDPage();
        doc.addPage(page);
        try (   PDPageContentStream contentStream = new PDPageContentStream(doc, page)) {
            contentStream.beginText();
            for (TextWithFont textWithFont : fontifiedContent) {
                textWithFont.show(contentStream, 12);
            }
            contentStream.endText();
        }
        ByteArrayOutputStream os = new ByteArrayOutputStream();
        doc.save(os);
        return os;
    }
}

(AddTextWithDynamicFonts helper method)

I get

for latinText = "This is latin text"
for japaneseText = "これは日本語です"
and for mixedText = "Tこhれiはs日本i語sですlatin text"

Some asides

I retrieved the fonts as Java resources but you can use any kind of InputStream for them.
The font selection mechanism above can quite easily be combined with the line breaking mechanism shown in this answer and the justification extension thereof in this answer

Handle many unicode caracters with PDFBox

2 Answers

Font selection per character

Example use

Some asides