I have a csv file containing some characters that lie outside Unicode BMP, for example the character ????. They are SMP characters, so they need to be stored in utf8mb4 charset and utf8mb4_general_ci collation in MySQL instead of utf8 charset and utf8_general_ci collation.
So here are my SQL queries.
MariaDB [tweets]> set names 'utf8mb4';
Query OK, 0 rows affected (0.01 sec)
MariaDB [tweets]> create table test (a text) collate utf8mb4_general_ci;
Query OK, 0 rows affected (0.06 sec)
MariaDB [tweets]> insert into test (a) values ('????');
Query OK, 1 row affected (0.03 sec)
MariaDB [tweets]> select * from test;
+------+
| a |
+------+
| ???? |
+------+
1 row in set (0.00 sec)
No warnings. Everything is right. Now I want to load that csv file. For test, the file has only one line.
MariaDB [tweets]> load data local infile 't.csv' into table wzyboy character set utf8mb4 fields terminated by ',' enclosed by '"' lines terminated by '\n\n' (tweet_id,in_reply_to_status_id,in_reply_to_user_id,retweeted_status_id,retweeted_status_user_id,timestamp,source,text,expanded_urls); Query OK, 1 row affected, 7 warnings (0.01 sec) Records: 1 Deleted: 0 Skipped: 0 Warnings: 7
The warning message is:
| Warning | 1366 | Incorrect string value: '\xF0\x9F\x80\x80' for column 'text' at row 1 |
All my working environments (OS, Termianl, etc) use UTF-8. I have specified utf8mb4 in everyplace I could think up of, and if I manually INSERT INTO it works just fine. However, when I use LOAD DATA INFILE [...] CHARACTER SET utf8mb4 [...] it just fails with error "Incorrect string value".