1
votes

Is there a more or less standard way to transliterate Polish alphabet with the original ASCII (US-ASCII) characters?

This question can be broken in two related and more precise questions:

  1. How to transliterate 32 letters of Polish alphabet with only 26 letters of basic Latin alphabet maximizing understanding by a Polish reader?
  2. Is there a reversible way to transliterate any Polish text with US-ASCII characters?

I can see that most Polish websites just remove the diacritics in their URLs. For example:

Świętosław Milczący    →  Swietoslaw Milczacy
Dzierżykraj Łaźniński  →  Dzierzykraj Lazninski
Józef Soćko            →  Jozef Socko

This is hardly reversible, but is it the most readable transliteration for Polish readers?

In some other cases, more complicated ad hoc transliteration might be used, like Wałęsa → Wawensa. Are there any standard rules for doing this latter kind of transformations?

P.S. Just to clarify, I'm interested in transliteration rules (like ł → w, ę → en), not the implementation. Something like this table.

2
Can you add apostrophes, e.g. S'wie'tosl'aw?i486
@i486, for second question you can.Andriy Makukha
I'm voting to close this question as off-topic because it is not about programming.High Performance Mark
@HighPerformanceMark, I consider this to be an algorithms question. My purpose is to transliterate Polish letters to URIs of a website so that a URI can be reversed into proper Polish.Andriy Makukha

2 Answers

1
votes

Ad. 1. The Polish alphabet consists only of two groups of letters: the Latin letters and the Latin letters with diacritics. Therefore the only used way to transliterate the Polish letters is to remove diacritic for the last group, for example:

ą --> a
ć --> c
ż --> z
ź --> z
...

This way is the most readable transliteration.

Ad. 2. Definitely no.

1
votes

You could encode presense of diacritics as some kind of ternary number, and store them near the plain ASCII transliteration to make it reversible.

URLs often contain some additional IDs, even this one: 48686148/how-to-transliterate-polish-alphabet-with-us-ascii

Here is example implementation:

trans_table = {
    'A': ('A', 0),   'a': ('a', 0),
    'Ą': ('A', 1),   'ą': ('a', 1),
    'B': ('B', 0),   'b': ('b', 0),
    'C': ('C', 0),   'c': ('c', 0),
    'Ć': ('C', 1),   'ć': ('c', 1),
    'D': ('D', 0),   'd': ('d', 0),
    'E': ('E', 0),   'e': ('e', 0),
    'Ę': ('E', 1),   'ę': ('e', 1),
    'F': ('F', 0),   'f': ('f', 0),
    'G': ('G', 0),   'g': ('g', 0),
    'H': ('H', 0),   'h': ('h', 0),
    'I': ('I', 0),   'i': ('i', 0),
    'J': ('J', 0),   'j': ('j', 0),
    'K': ('K', 0),   'k': ('k', 0),
    'L': ('L', 0),   'l': ('l', 0),
    'Ł': ('L', 1),   'ł': ('l', 1),
    'M': ('M', 0),   'm': ('m', 0),
    'N': ('N', 0),   'n': ('n', 0),
    'Ń': ('N', 1),   'ń': ('n', 1),
    'O': ('O', 0),   'o': ('o', 0),
    'Ó': ('O', 1),   'ó': ('o', 1),
    'P': ('P', 0),   'p': ('p', 0),
    'R': ('R', 0),   'r': ('r', 0),
    'S': ('S', 0),   's': ('s', 0),
    'Ś': ('S', 1),   'ś': ('s', 1),
    'T': ('T', 0),   't': ('t', 0),
    'U': ('U', 0),   'u': ('u', 0),
    'W': ('W', 0),   'w': ('w', 0),
    'Y': ('Y', 0),   'y': ('y', 0),
    'Z': ('Z', 0),   'z': ('z', 0),
    'Ź': ('Z', 1),   'ź': ('z', 1),
    'Ż': ('Z', 2),   'ż': ('z', 2),
}



def pol2ascii(text):
    plain = []
    diacritics = []
    for c in text:
        ascii_char, diacritic = trans_table.get(c, (c, 0))
        plain.append(ascii_char)
        diacritics.append(str(diacritic))

    return ''.join(plain) + '_' + hex(int('1' + ''.join(reversed(diacritics)), 3))[2:]

reverse_trans_table = {
    k: v for v, k in trans_table.items()
}

def ascii2pol(text):
    plain, diacritics = text.rsplit('_', 1)
    diacritics = int(diacritics, base=16)
    res = []

    for c in plain:
        diacritic = diacritics % 3
        diacritics = diacritics // 3
        pol_char = reverse_trans_table.get((c, diacritic), c)
        res.append(pol_char)

    return ''.join(res)


TESTS = '''
Świętosław Milczący
Dzierżykraj Łaźniński
Józef Soćko
'''

for l in TESTS.strip().splitlines():
    plain = pol2ascii(l)
    original = ascii2pol(plain)
    print(original, plain)
    assert original == l