nroff/groff does not properly convert utf-8 encoded file

Question

I am having a utf-8 encoded roff-file that I want to convert to a manpage with

$ nroff -mandoc inittab.5

However, characters in [äöüÄÖÜ], e.g. are not displayed properly as it seems that nroff assumes ISO 8859-1 encoding (I am getting [Ã¤Ã¶Ã¼ÃÃÃ] instead. Calling nroff with the -Tutf8 flag does not change the behaviour and the locale environment variables are (I assume properly) set to

LANG=de_DE.utf8
LC_CTYPE="de_DE.utf8"
LC_NUMERIC="de_DE.utf8"
LC_TIME="de_DE.utf8"
LC_COLLATE="de_DE.utf8"
LC_MONETARY="de_DE.utf8"
LC_MESSAGES="de_DE.utf8"
LC_PAPER="de_DE.utf8"
LC_NAME="de_DE.utf8"
LC_ADDRESS="de_DE.utf8"
LC_TELEPHONE="de_DE.utf8"
LC_MEASUREMENT="de_DE.utf8"
LC_IDENTIFICATION="de_DE.utf8"
LC_ALL=

Since nroff is only a wrapper-script and eventually calles groff I checked the call to the latter which is:

$ groff -Tutf8 -mandoc inittab.5

Comparing the byte-encodings of characters in the src file and the output file I am getting the following conversions:

character  src file  output file
---------  --------  -----------
ä          C3 A4     C3 83 C2 A4
ö          C3 B6     C3 83 C2 B6
ü          C3 BC     C3 83 C2 BC
Ä          C3 84     C3 83
Ö          C3 96     C3 83
Ü          C3 9C     C3 83
ß          C3 9F     C3 83

This behaviour seems very weird to me (why am I getting an additional C3 83 and have the original byte-sequence truncated alltogether for big umlauts and ß?)

Why is this and how can I make nroff/groff properly convert my utf-8 encoded file?

EDIT: I am using GNU nroff (groff) version 1.22.2

When you run say less inittab.5 do you see proper characters? By the way the question is off topic for this site, you may have better luck at unix/linux stackexchange. — n. 1.8e9-where's-my-share m.
Evidently nroff thinks its input is Latin-1 and tries to transcode it to UTF-8. Try running with -Tlatin1 to avoid transcoding. — n. 1.8e9-where's-my-share m.
It looks like groff doesn't support UTF-8 input at all. gnu.org/software/groff/manual/html_node/Input-Encodings.html — n. 1.8e9-where's-my-share m.
Ok, that makes sense. How come most of my Gentoo programs come with utf-8 encoded man pages then? I could convert them to latin1, but that would ommit other characters. Are you aware of a nroff alternative that supports utf-8 input? — Simon Fromme

ToasterKing ToasterKing · Accepted Answer · 2018-12-05T06:24:37

Unlike other troff implementations (namely Plan 9 and Heirloom troff), groff does not support UTF8 in documents. However, UTF8 output can be achieved using the preconv(1) pre-processor, which converts UTF8 characters in a file to groff native escape sequences.

Take for example this groff_ms(7) document:

.TL
StackOverflow Test Document
.AU
ToasterKing
.PP
I like going to the café down the street

äöüÄÖÜ

Using groff normally, we get:

                StackOverflow Test Document


                        ToasterKing


     I like going to the cafÃ© down the street

Ã¤Ã¶Ã¼ÃÃÃ

But when using preconv | groff or groff -k, we get:

                StackOverflow Test Document


                        ToasterKing


     I like going to the café down the street

äöüÄÖÜ

Looking at the output of preconv, you can see how it transforms characters into escape sequences:

.lf 1 so.ms
.TL
StackOverflow Test Document
.AU
ToasterKing
.PP
I like going to the caf\[u00E9] down the street

\[u00E4]\[u00F6]\[u00FC]\[u00C4]\[u00D6]\[u00DC]

nroff/groff does not properly convert utf-8 encoded file

1 Answers