Data structure for multi-language dictionary?

Question

One-line summary: suggest optimal (lookup-speed/compactness) data structure(s) for a multi-lingual dictionary representing primarily Indo-European languages (list at bottom).

Say you want to build some data structure(s) to implement a multi-language dictionary for let's say the top-N (N~40) European languages on the internet, ranking choice of language by number of webpages (rough list of languages given at bottom of this question). The aim is to store the working vocabulary of each language (i.e. 25,000 words for English etc.) Proper nouns excluded. Not sure whether we store plurals, verb conjugations, prefixes etc., or add language-specific rules on how these are formed from noun singulars or verb stems. Also your choice on how we encode and handle accents, diphthongs and language-specific special characters e.g. maybe where possible we transliterate things (e.g. Romanize German ß as 'ss', then add a rule to convert it). Obviously if you choose to use 40-100 characters and a trie, there are way too many branches and most of them are empty.

Task definition: Whatever data structure(s) you use, you must do both of the following:

The main operation in lookup is to quickly get an indication 'Yes this is a valid word in languages A,B and F but not C,D or E'. So, if N=40 languages, your structure quickly returns 40 Booleans.
The secondary operation is to return some pointer/object for that word (and all its variants) for each language (or null if it was invalid). This pointer/object could be user-defined e.g. the Part-of-Speech and dictionary definition/thesaurus similes/list of translations into the other languages/... It could be language-specific or language-independent e.g. a shared definition of pizza)

And the main metric for efficiency is a tradeoff of a) compactness (across all N languages) and b) lookup speed. Insertion time not important. The compactness constraint excludes memory-wasteful approaches like "keep a separate hash for each word" or "keep a separate for each language, and each word within that language".

So:

What are the possible data structures, how do they rank on the lookup speed/compactness curve?
Do you have a unified structure for all N languages, or partition e.g. the Germanic languages into one sub-structure, Slavic into another etc? or just N separate structures (which would allow you to Huffman-encode )?
What representation do you use for characters, accents and language-specific special characters?
Ideally, give link to algorithm or code, esp. Python or else C. -

(I checked SO and there have been related questions but not this exact question. Certainly not looking for a SQL database. One 2000 paper which might be useful: "Estimation of English and non-English Language Use on the WWW" - Grefenstette & Nioche. And one list of multi-language dictionaries) Resources: two online multi-language dictionaries are Interglot (en/ge/nl/fr/sp/se) and LookWayUp (en<->fr/ge/sp/nl/pt).

Languages to include:

Probably mainly Indo-European languages for simplicity: English, French, Spanish, German, Italian, Swedish + Albanian, Czech, Danish, Dutch, Estonian, Finnish, Hungarian, Icelandic, Latvian, Lithuanian, Norwegian, Polish, Portuguese, Romanian, Russian, Serbo Croat, Slovak, Slovenian + Breton, Catalan, Corsican, Esperanto, Gaelic, Welsh

Probably include Russian, Slavic, Turkish, exclude Arabic, Hebrew, Iranian, Indian etc. Maybe include Malay family too. Tell me what's achievable.

Do you handle localised words? (Localized is the WRONG spelling!) — Arafangion
I don't know, got any numbers on how they increase the vocabulary size? Are we talking about variants of the same word (e.g. US -ize vs British -ise), or adding entirely new words? I'm only thinking about a working vocabulary (e.g. 25,000 words in English), not a full lexicon. Let's say: somewhat, up to the point that it doesn't detract from the primary task being constructing a multi-lingual structure for N languages. — smci
Variations, and is the kind of thing that Mac OS X seems to support by default. — Arafangion
@Arafangion: Neither localiSed nor localiZed is wrong because they are locali*ed. — Alexey Frunze
@Alex: whatevs. I'm really interested in quality answers to this question... someone out there must have some suggestion. — smci

templatetypedef templatetypedef · Accepted Answer · 2011-08-15T08:33:43

I'm not sure whether or not this will work for your particular problem, but here's one idea to think about.

A data structure that's often used for fast, compact representations of language is a minimum-state DFA for the language. You could construct this by creating a trie for the language (which is itself an automaton for recognizing strings in the language), then using of the canonical algorithms for constructing a minimum-state DFA for the language. This may require an enormous amount of preprocessing time, but once you've constructed the automaton you'll have extremely fast lookup of words. You would just start at the start state and follow the labeled transitions for each of the letters. Each state could encode (perhaps) a 40-bit value encoding for each language whether or not the state corresponds to a word in that language.

Because different languages use different alphabets, it might a good idea to separate out each language by alphabet so that you can minimize the size of the transition table for the automaton. After all, if you have words using the Latin and Cyrillic alphabets, the state transitions for states representing Greek words would probably all be to the dead state on Latin letters, while the transitions for Greek characters for Latin words would also be to the dead state. Having multiple automata for each of these alphabets thus could eliminate a huge amount of wasted space.

Data structure for multi-language dictionary?

5 Answers