16
votes

I tried several csv-formats (different escape characters, quotes and other settings) to export data from MySQL and to import it into BigQuery, but I was not able to find a solution that works in every case.

Google SQL requires the following Code for importing/exporting from/to MySQL. Although, Cloud SQL is not BigQuery, it is a good starting point:

SELECT * INTO OUTFILE 'filename.csv' CHARACTER SET 'utf8' 
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' ESCAPED BY '' FROM table

At the moment I use the following command to import a compressed csv into BigQuery: bq --nosync load -F "," --null_marker "NULL" --format=csv PROJECT:DATASET.tableName gs://bucket/data.csv.gz table_schema.json

On one hand the bq-command does not allow to set the escape character (" is escaped by another ", which seems to be a well defined CSV-format). On the other hand \" as escape character for MySQL-export would lead to "N as Null-value, which does not work too:

CSV table references column position 34, but line starting at position:0 contains only 34 columns. (error code: invalid)

So my question is: How to write a (table-independent) export command for MySQL in SQL, such that the generated file can be loaded into BigQuery. Which escape character should be used and how to handle/set null values?

6
Would setting --null_marker to "N work?Tianzhou
@Tianzhou I tried --null_marker to "\"N" without success.NaN

6 Answers

11
votes

I've been running with the same problem, here's my solution:

Exporting data from MySQL

First, export the data from MySQL this way:

SELECT * INTO OUTFILE 'filename.csv' CHARACTER SET 'utf8' 
FIELDS TERMINATED BY '\t' OPTIONALLY ENCLOSED BY '' 
FROM table <yourtable>

This is in reality a tsv file (tab separated values), but you can import them as csv thought.

Import into Big Query

This way you should be able to import it into big query with the following parameters:

bq load --field_delimiter="\t" --null_marker="\N" --quote="" \
PROJECT:DATASET.tableName gs://bucket/data.csv.gz table_schema.json

Notes

  1. If any field in you MySQL database contains a tab character (\t), it will break your columns. To prevent that you can add the SQL function REPLACE(<column>, '\t', ' ') on the columns and it will convert from tabs to spaces.

  2. If you set the table schema in big query's web interface you won't need to specify it every time you load a CSV.

I hope this works for you.

4
votes

You could try sqldump-to. It reads in any MySQL compatible dump stream and outputs newline delimited JSON for easy import into BigQuery.

The problem with CSV or TSV are escape characters. JSON doesn't really have that problem.

The tool also supports schema export, which will need to be edited afterwards with specific BigQuery data types per column, but it's a useful head start.

For example, use mysqldump to stream into sqldump-to:

mysqldump -u user -psecret dbname | sqldump-to --dir-output ./dbname --schema

You may need to modify the mysqldump command to match your particular MySQL configuration (eg. remote servers etc.)

If you already have a dump file, the tool also supports multiple workers to better utilize your CPU.

Once sqldump-to has created your JSON files, simply use the bq command line tool to load into BigQuery:

bq load --source_format=NEWLINE_DELIMITED_JSON datasetname.tablename tablename.json tablename_schema.json
2
votes

Using the following SQL command seems to work for me, producing null values with \N:

SELECT * INTO OUTFILE '/tmp/foo.csv' CHARACTER SET 'utf8'  
FIELDS TERMINATED BY ',' OPTIONALLY ENCLOSED BY '\"' ESCAPED BY "\\" 
FROM table;

With this you should be able to import the data using --null_marker="\N".Can you give that a try and let me know if it doesn't work for you?

2
votes

UPDATE 2019:

Try this as an alternative:

  • Load the MySQL backup files into a Cloud SQL instance.
  • Read the data in BigQuery straight out of MySQL.

Longer how-to:


You could use a tool like mysql2xxxx for maximum flexibility when exporting.

With mysql2csv you can run an arbitrary query, and the output process leverages FasterCSV, which will give you more options than the stock mysql ones.

1
votes

I had the very same problem, importing from MySQL to Big Query, and as my dataset contains several text columns, I couldn't use a standard separator like , ; or even \t without encloser.

But with the encloser, I had either the escaping double-quote issue with the default \ escaper, either the null value issue with the " escaper, which became "N instead of \N.

I was able to make it work by using the following steps and configuration. The trick is to use a control character as a safe separator, as I'm sure there is no such character in my data.

Step 1 : export from MySQL

configuration :

  • Field separator : control character 001
  • Encloser : '' ( none )

Here is the complete MySQL query. I'm using AWS RDS Aurora, so the syntax is a little different from the standard MySQL (file is written to S3) :

SELECT * FROM my_table
INTO OUTFILE S3 's3://xxxxx/tmp/my_table/data'
CHARACTER SET UTF8MB4 
FIELDS TERMINATED BY x'01'
OPTIONALLY ENCLOSED BY ''
MANIFEST OFF 
OVERWRITE ON

Step 2 : Copy the dataset to cloud storage with gsutil

gsutil rsync -m s3://xxxxx/tmp/my_table/ gs://xxxxx/tmp/my_table/

Step 3 : Load data on Big Query using CLI

bq load --source_format=CSV --field_delimiter=^A --null_marker="\N" --quote="" project:base.my_table gs://xxxxx/tmp/my_table/* ./schema.json

Notes

  • ^A is the representation of the control character. You can create it on Windows by typing Alt+001, on linux shell with Ctrl+V and Ctrl+A ( more informations here ). It's actually just one character.
  • We can't use the web interface to create the table, as we can't define a control character as separator.
0
votes

A MySQL Table to BigQuery Import Script.md will export a MySQL table or complete schema to Big Query.

mysql_table_to_big_query.sh exports the table from MySQL to CSV and exports the schema to JSON and SQL file. The files are then uploaded to a folder of the cloud bucket. These files are then imported to big query. A BigQuery Dataset is created in the same project (if not existing) with the name {SCHEMA_NAME}_{DATE}. If the table has a column of datatype DATE, then the table is partitioned in BigQuery.

mysql_schema_to_big_query.sh extracts a list of all tables from the MySQL schema and calls mysql_table_to_big_query.sh for each. The script creates csv files, translating the nulls as required. It then transfers them to an existing google storage and imports to big query.