I am using Google Cloud's import API to load CSV file in Google Cloud Storage to a Cloud SQL MySQL 5.7 database:
https://cloud.google.com/sql/docs/mysql/admin-api/v1beta4/instances/import
The import API generates a LOAD DATA INFILE command to MySQL database that assumes 'utf8' character set. However, the data in my CSV files uses 'utf8mb4' encoding which is a superset of 'utf8'. This causes the load process to fail when some strings cannot be encoded to 'utf8':
Exception: CloudSQL Exception: {'kind': 'sql#operation', 'selfLink': 'https://www.googleapis.com/sql/v1beta4/projects/***', 'targetProject': '***', 'targetId': '***', 'targetLink': 'https://www.googleapis.com/sql/v1beta4/projects/***', 'name': '0211c99e-0633-42f1-9ee1-069473308273', 'operationType': 'IMPORT', 'status': 'RUNNING', 'user': '***', 'insertTime': '2019-01-14T02:36:39.861Z', 'startTime': '2019-01-14T02:36:39.972Z', 'error': {'kind': 'sql#operationErrors', 'errors': [{'kind': 'sql#operationError', 'code': 'ERROR_RDBMS', 'message': "Import CSV error: Error 1300: Invalid utf8 character string: ''Afikanisitani|'Apekanikana|A Phu Han (Afghanistan)|A Phú Hãn '\n"}]}, 'importContext': {'kind': 'sql#importContext', 'uri': '***', 'database': '**', 'importUser': '', 'csvImportOptions': {'table': '***'}}}
Relevant piece: "'message': "Import CSV error: Error 1300: Invalid utf8 character string:"
Is there a way to add 'utf8mb4' CHARACTER SET using the import API (or any other character set for that matter)?
I tried experimenting with adding 'character set': 'utf8mb4' in 'csvImportOptions' dictionary, but it appears import API only expects 'table' and 'columns' keys in that dict.
Note that if I directly run the LOAD DATA INFILE command from MySQL client, I am able to import the CSV with no issues:
LOAD DATA INFILE 'myCSVFile.csv'
INTO TABLE 'my table'
CHARACTER SET utf8mb4
FIELDS TERMINATED BY ','
LINES TERMINATED BY '\n'
IGNORE 1 ROWS
from pprint import pprint
from googleapiclient import discovery
from oauth2client.client import GoogleCredentials
credentials = GoogleCredentials.get_application_default()
service = discovery.build('sqladmin', 'v1beta4', credentials=credentials)
# Project ID of the project that contains the instance.
project = 'my-project' # TODO: Update placeholder value.
# Cloud SQL instance ID. This does not include the project ID.
instance = 'my-instance' # TODO: Update placeholder value.
instances_import_request_body = {
"importContext": {
"kind": "sql#importContext",
"fileType": "CSV",
"uri": gcs_uri,
"database": database,
"csvImportOptions": {
"table": table
}
}
}
request = service.instances().import_(project=project, instance=instance, body=instances_import_request_body)
response = request.execute()
ADDITIONAL DATA POINT It is clear to me that the LOAD DATA INFILE query that Google's API is generating is defaulting to 'utf8' character set.
Fails with the same error message as the API
LOAD DATA INFILE 'problematic.csv'
INTO TABLE my_table
**CHARACTER SET utf8**
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\"'
ERROR 1300 (HY000): Invalid utf8 character string: ''Afikanisitani|'Apekanikana|A Phu Han (Afghanistan)|A Phú Hãn '
Works:
LOAD DATA INFILE 'problematic.csv'
INTO TABLE my_table
**CHARACTER SET utf8mb4**
FIELDS TERMINATED BY ','
OPTIONALLY ENCLOSED BY '\"'
ESCAPED BY '\"'
Query OK, 75641 rows affected (1.14 sec)
Records: 75641 Deleted: 0 Skipped: 0 Warnings: 0
The documentation here is incorrect: https://cloud.google.com/sql/docs/mysql/import-export/importing
LOAD DATA INFILE ... ***CHARACTER SET 'utf8mb4'***
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' ESCAPED BY '\"'.