3
votes

I'll be implementing this for Node.js (server side Javascript), but this question is about the general approach on how to solve this problem.

There are many platforms that support translation for international applications.

For example, Zend's Translation Adapter works like this:

printf($translate->_("Today is the %1\$s") . "\n", date("d.m.Y"));

Android's system uses a strings.xml file for every language and works with the same concept as Zend's.

These work for most western languages. However, many non-western languages require different word orders or are even read from the right-to-left instead of left-to-right direction.

Thus, the specified order defined in the above translate call may be invalid for a "foreign" language.

This brings me to my question, how does one design a translation system/adapter that is appropriate for any language?

2

2 Answers

5
votes

It is actually very hard to answer this question directly. There are a lot of use cases here. If I was to design such system, I would have keep these things in mind:

1. Sentence might need to be re-ordered after translation (you already brought this up). That is the reason why we use numbered placeholders like {1}, {2} and some means of Formatting the Message.

2. There are quite a few languages that have more than one plural form. That is, if message contain some number, depending on quantity it would be translated in a different way. For example:

English: 1 virus has been found | 2 viruses have been found | 5 viruses have been found

Polish: Znaleziono 1 wirusa | Znaleziono 2 wirusy | Znaleziono 5 wirusów

That is not easy to handle, but I really like the way GetText does this (there is some expression which will decide what form to use, as well as support for multiple forms).

3. Users of such library might want to have named placeholders (see previous questions in I18n tags), like this "This is a message for ${name} in ${location}" and use it for example like this:

var formatted = 'This is a message for ${name} in ${location}'.format('location=Warsaw', 'name=Paweł');

While this poses some i18n issue, I am pretty sure that it could be done in JavaScript (although the way you pass named parameters (aka arguments) might need to be different.

4. Java tend to format Numbers as well as Dates for a specific locale in MessageFormat.format() method. This is not the ideal behavior, and it poses few problems, especially in JavaScript. Well, first thing you need to know is, what is current user's Locale. If you do, is it easy? Well, no. There are quite a few possible date formats - Java enumerates them as: full, long, medium, short and default. Unfortunately, there is no distinction during formatting - AFAIR short would always be used. Of course, one could pass his format to placeholder as something like this (AFAIR): {0,date,yyyy-MM-dd}. This poses another issue: the Translators would always have to provide the format. This is error prone. Instead, I would format with default pattern (if no additional info is given) and allow passing pattern names: {0,date,long}.

For numbers, it could be anything: currency, percentage or simple numeric value. You would also need to support the distinction, some examples: {0,currency,symbol:$,long}, {0,percentage}, {0,number,long}. It is not easy to guess what I mean, but for large numbers you might want to use grouping separators (1,000,000.00$), let's call it long format, whereas sometimes you would like to print number like this: 1234. Not an easy task.

5. .Net has concept of User Interface Culture (CurrentUICulture) and Formatting Culture (CurrentCulture). First is in use to determine the appropriate language for User Interface messages, whereas second is in use for formatting (numbers, dates, currencies, etc.).

6. Different languages tend to use different Collation order, heck even the same language could use two (or more) different ones. I am not sure if it fits the scope, but it at least good to be aware of.

7. Support for different Character Encodings might be required (and probably will be). However, you might want to limit the Encoding for resources file to say UTF-8. It won't cover all possible characters (see GB18030 for example), but it is close.

... ?

Well, I am sure I forgot something major, as the task you are approaching is monumental. And I don't know much about Node.js (as in what is currently supported).

Edit

8. Of course I forgot to mention that as software evolves, only few User Interface messages change, therefore there is some need of merging the old translations (it is called Leveraging in L10n terms). Usually some kind of Translation Memory software is in use (for example POEdit, the GetText file format editor has such features built in). The TM software usually have support limited to certain file formats only, so it would be a good idea to stick with existing format rather than creating your own. This could mean dropping some features off the list...

3
votes

Your design should allow for...

Reordering of parameters

As you have identified, translators may need to reorder parameters to suit different grammars. So, whatever system you use, you need to either make the parameters named or give them an index.

Formatters

I guess you could leave these to the developer to transform before substituting them, but somewhere people are going to want to do locale-sensitive formatting of numbers, currencies, dates and times. You may want to stretch that to pluralization, but that's a can of worms you may not want to open.

Unique keys

The lookup keys need to be unique. Using the untranslated string as a key is risky as translations of identical source strings may differ depending on their context.

Tools

Letting translators loose with "plain text files" is likely to cause trouble. You'll ideally want some mechanism to handle encodings, add translation comments from specialists, recover translations between versions and validate resultant strings to ensure the substitution parameters match the source strings.


I'd look at the ICU, .Net and Java APIs for inspiration.