0
votes

I have data in mysql table with charset- utf-8. I have one pyspark script which loads mysql data and write a parquet file in s3 bucket. While fetching data from mysql i am getting data in below Format:

'الشرقية'

Then i converted it to utf-8 encoding i got below unicode string:

'\xc3\x98\xc2\xa7\xc3\x99\xe2\x80\x9e\xc3\x98\xc2\xb4\xc3\x98\xc2\xb1\xc3\x99\xe2\x80\x9a\xc3\x99\xc5\xa0\xc3\x98\xc2\xa9'

After that i am decoded it to mac_arabic encoding then i am getting below text:

'أ»آ'أôقÄûأ»آ٤أ»آ١أôقÄöأôإ أ»آ)'

Is there a way to generate arabic text from any one these string.

below is the code

sqlContext = SQLContext(sc)
df = sqlContext.read.format("jdbc").options(
                                       url="jdbc:mysql://localhost/db_name",
                                       driver="com.mysql.jdbc.Driver",
                                       dbtable="table",
                                       user="root",
                                       password="root"
                                      ).load()

df.show()

For columns in table below config is set: CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci DEFAULT NULL

for database below config is set: ENGINE=InnoDB AUTO_INCREMENT=42627 DEFAULT CHARSET=latin1

Thanks in advance.

1
How do you load and write the first string? Show us the code you have used. If your initial string is utf-8 encoded you have to decode it with the same encoding, i.e. utf-8. - mehdix
@MehdiSadeghi i am using sqlContext.read.format("jdbc").options().load() to get the whole table in dataframe. First text is the value i am getting from mysql table after running above command. - Raghav salotra
Add a piece of working code to your question please. Including your database configuration. How did you add the string to the db in the first place? How are you configuring sqlContext? How are you printing? Which client, etc. btw, your first string is gibberish, no encoding, decoding will convert it back into arabic. - mehdix
@MehdiSadeghi i have added the code. Please check. - Raghav salotra
@MehdiSadeghi Is there any other information you need? - Raghav salotra

1 Answers

1
votes

The version of JDBC driver on your platform is not using UTF-8 encoding by default. As stated in the comments above try to explicitly pass the encoding to the driver:

df = sqlContext.read.format("jdbc").options(
    url="jdbc:mysql://localhost/db_name?characterEncoding=utf8",
    driver="com.mysql.jdbc.Driver",
    dbtable="table",
    user="root",
    password="root").load()