6
votes

We have a requirement to transliterate Arabic text to Latin characters(without diacritical marks) and display them to users.

We are currently using IBM ICU4j for this. The API doesn't trasliterate well the Arabic text into proper readable latin characters. Refer the below examples:

Example

  • Arabic text :

    صدام حسين التكريتي

  • Google's transliteration output

    : Sadaam Hussein al-tikriti

  • ICU4J's transliteration outuput

    : ṣdạm ḥsyn ạltkryty

How can we improve the transliterated output of ICU4j library?

ICU4J gives us an option to write our own rules but we are currently stuck as no one from our team knows Arabic and are unable to find any proper standard that can be followed.

1
Is there any reason why you can't use Google's transliteration API? Since Arabic script is missing most of the vowels, you can't do a rule based transliteration from Arabic to Latin, but you will have to lookup the Arabic word in a dictionary, likely in connection with context knowledge to distinguish words, which are written equally in Arabic script, but with different transliterations.jarnbjo
@jarnbjo Thanks for your interest. Google's transliteration API is not free and we want to use something that is open-source.Kamlesh Sharma

1 Answers

1
votes

It's took 4 hours me to research out any other source to tackle out this problem.Later i tried ICU4J and find the solution for your problem .You can run the code and see the point which you was missing.

package com.webom.crypt;

import org.apache.commons.lang3.StringEscapeUtils;

import com.ibm.icu.text.Transliterator;

public class Test {



        public static String ARABIC_TO_LATIN = "Arabic-Latin";
        public static String ARABIC_TO_LATIN_NO_ACCENTS = "Arabic-Latin; nfd; [:nonspacing mark:] remove; nfc";

        public static void main(String[] args) {
            String ARABICString = "صدام حسين التكريتي";

            String unicodeCodes = StringEscapeUtils.escapeJava(ARABICString);
            System.out.println("Unicode codes:" + unicodeCodes);
 ///YOUR WAY
            Transliterator ARABICToLatinTrans = Transliterator.getInstance(ARABIC_TO_LATIN);
            String result1 = ARABICToLatinTrans.transliterate(ARABICString);
            System.out.println("ARABIC to Latin:" + result1);
    //MINE WAY      
            Transliterator ARABICToLatinNoAccentsTrans = Transliterator.getInstance(ARABIC_TO_LATIN_NO_ACCENTS);
            String result2 = ARABICToLatinNoAccentsTrans.transliterate(ARABICString);
            System.out.println("ARABIC to Latin (no accents):" + result2);
        }
    }

Just checkout the answer and verify on your own.As the output you receive will be exactly as shown below.

 Unicode codes:\u0635\u062F\u0627\u0645 \u062D\u0633\u064A\u0646\u0627\u0644\u062A\u0643\u0631\u064A\u062A\u064A

ARABIC to Latin:ṣdạm ḥsyn ạltkryty

ARABIC to Latin (no accents):sdam hsyn altkryty