1
votes

I have a MySQL table with arabic strings. But the table character set is latin1. I want to convert the same to utf8 so that the table can be displayed properly.

I have gone through this: http://www.bothernomore.com/2008/12/16/character-encoding-hell/

But it did not work for arabic characters. I have also see a post here: Latin1 to UTF8 conversion the comment says:

latin1 doesn't have support for Arabic characters. How can your text be stored as latin1?

does it mean that i cannot convert it to any characterset which could display the arabic characters.

1
@Ansari i know this work around but the problem is i cannot edit the dump file since it is too big to open in any text editor.sam
use a utility like sed or something to do it programmatically.Ansari

1 Answers

3
votes

The Latin1 (ISO 8859-1) code set is for western European languages and simply does not have Arabic characters. You need ISO 8859-6 to get Arabic characters. Now, you could have code points in the range 0x00..0xFF that are valid Arabic characters in 8859-6 and appear as European accented characters in 8859-1, and you could arrange to map the 8859-6 values to UTF8. The lower half of the range of 8859-6 is the same as 8859-1; this is true for all 8859-x code sets, in fact, and 'half' is really 5/8ths since the code points 0x80..0x9F are control codes.

The characters defined in 8859-6 that are not the same as in 8859-1 start at 0xA0. There are lots of gaps in the 8859-6 code set.

A0 U+00A0 NO-BREAK SPACE
A4 U+00A4 CURRENCY SIGN
AC U+060C ARABIC COMMA
AD U+00AD SOFT HYPHEN

BB U+061B ARABIC SEMICOLON
BF U+061F ARABIC QUESTION MARK

C1 U+0621 ARABIC LETTER HAMZA
C2 U+0622 ARABIC LETTER ALEF WITH MADDA ABOVE
C3 U+0623 ARABIC LETTER ALEF WITH HAMZA ABOVE
C4 U+0624 ARABIC LETTER WAW WITH HAMZA ABOVE
C5 U+0625 ARABIC LETTER ALEF WITH HAMZA BELOW
C6 U+0626 ARABIC LETTER YEH WITH HAMZA ABOVE
C7 U+0627 ARABIC LETTER ALEF
C8 U+0628 ARABIC LETTER BEH
C9 U+0629 ARABIC LETTER TEH MARBUTA
CA U+062A ARABIC LETTER TEH
CB U+062B ARABIC LETTER THEH
CC U+062C ARABIC LETTER JEEM
CD U+062D ARABIC LETTER HAH
CE U+062E ARABIC LETTER KHAH
CF U+062F ARABIC LETTER DAL

D0 U+0630 ARABIC LETTER THAL
D1 U+0631 ARABIC LETTER REH
D2 U+0632 ARABIC LETTER ZAIN
D3 U+0633 ARABIC LETTER SEEN
D4 U+0634 ARABIC LETTER SHEEN
D5 U+0635 ARABIC LETTER SAD
D6 U+0636 ARABIC LETTER DAD
D7 U+0637 ARABIC LETTER TAH
D8 U+0638 ARABIC LETTER ZAH
D9 U+0639 ARABIC LETTER AIN
DA U+063A ARABIC LETTER GHAIN

E0 U+0640 ARABIC TATWEEL
E1 U+0641 ARABIC LETTER FEH
E2 U+0642 ARABIC LETTER QAF
E3 U+0643 ARABIC LETTER KAF
E4 U+0644 ARABIC LETTER LAM
E5 U+0645 ARABIC LETTER MEEM
E6 U+0646 ARABIC LETTER NOON
E7 U+0647 ARABIC LETTER HEH
E8 U+0648 ARABIC LETTER WAW
E9 U+0649 ARABIC LETTER ALEF MAKSURA
EA U+064A ARABIC LETTER YEH
EB U+064B ARABIC FATHATAN
EC U+064C ARABIC DAMMATAN
ED U+064D ARABIC KASRATAN
EE U+064E ARABIC FATHA
EF U+064F ARABIC DAMMA

F0 U+0650 ARABIC KASRA
F1 U+0651 ARABIC SHADDA
F2 U+0652 ARABIC SUKUN

Any character in the range 0xA0..0xFF not listed above is not a valid Arabic character in 8859-6.

The iconv program can probably handle the conversion of 8859-6 to UTF-8; I have a program that can do it too, and this is one data file for that program. (It converts any single-byte code set, SBCS, to UTF8, given a suitable table.)

See: http://czyborra.com/charsets/iso8859.html#ISO-8859-6 for 8859-6 specifically and http://czyborra.com/charsets/iso8859.html generally for information about ISO 8859-x code sets. It also has links to other pages discussing different code sets.


Does it mean I cannot convert it to any character set which could display the Arabic characters?

No; you could convert it, but it definitely means you have to understand what the hell you mean by 'Arabic characters in Latin1' because the statement doesn't mean anything on its own — it is a contradiction in terms.

I've put a plausible spin on your statement that gives a meaningful interpretation of the data you've got, but I can't guarantee that it is the correct interpretation.

You'll have to know how the data was entered, what it is supposed to mean, and decide how to translate it. If your data was entered by someone using 8859-6 but it was stored into a column (table, database) that assumed it was 8859-1, you could extract the values, translate to UTF8 and insert the UTF8 data into a database that expects UTF8. (Actually, since 8859-1 will accept any arbitrary sequence of bytes, you can stuff the UTF8 into an 8859-1 column, noting that there'll be two bytes for each Arabic character. It won't be meaningful as 8859-1, but it will be accurate as long as you don't truncate anything. If you do truncate the string, some of the time, you'll break in the middle of a UTF8 character, and then anything that interprets the data as UTF8 will be unhappy with you.