2
votes

I have a csv file containing some characters that lie outside Unicode BMP, for example the character ????. They are SMP characters, so they need to be stored in utf8mb4 charset and utf8mb4_general_ci collation in MySQL instead of utf8 charset and utf8_general_ci collation.

So here are my SQL queries.

MariaDB [tweets]> set names 'utf8mb4';
Query OK, 0 rows affected (0.01 sec)

MariaDB [tweets]> create table test (a text) collate utf8mb4_general_ci;
Query OK, 0 rows affected (0.06 sec)

MariaDB [tweets]> insert into test (a) values ('????');
Query OK, 1 row affected (0.03 sec)

MariaDB [tweets]> select * from test;
+------+
| a    |
+------+
| ????     |
+------+
1 row in set (0.00 sec)

No warnings. Everything is right. Now I want to load that csv file. For test, the file has only one line.

MariaDB [tweets]> load data local infile 't.csv' into table wzyboy character set utf8mb4 fields terminated by ',' enclosed by '"' lines terminated by '\n\n' (tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,timestamp,source,text,expanded_urls);
Query OK, 1 row affected, 7 warnings (0.01 sec)      
Records: 1  Deleted: 0  Skipped: 0  Warnings: 7

The warning message is:

| Warning | 1366 | Incorrect string value: '\xF0\x9F\x80\x80' for column 'text' at row 1      |

All my working environments (OS, Termianl, etc) use UTF-8. I have specified utf8mb4 in everyplace I could think up of, and if I manually INSERT INTO it works just fine. However, when I use LOAD DATA INFILE [...] CHARACTER SET utf8mb4 [...] it just fails with error "Incorrect string value".

1

1 Answers

1
votes

Problem solved.

It was a mistake. During the experiment, I just TRUNCATE TABLE but not re-create it. So the database and the table are both utf8mb4, but the columns are still utf8...