Sanitise .csv before importing into Google BigQuery

Question

I have bunch of big csv files to be imported to BigQuery. Some of them contain rows with wrongly formatted data, some rows need to be modified (e.g. summarize two columns from the row) etc. What is the best way to clean this up? Should i just upload all the dirty data into a table and then run a Cloud Function script on each row to sanitize it, modify if needed and store into another "clean" table? Or should i do this with App Engine? Or may be i should make App Engine to process each csv and insert row by row into BigQuery?

Update:

I haven't found a way to run transformations/validations within GBQ without pulling this data to the app, which invokes it. For example, an API server (Kubernetes Engine) is responsible for file uploading. Then it triggers files import. But then what? NodeJS SDK has CRUD operations, but it requires to pull data from GBQ, which is severely inefficient. Is there way to trigger changes within GBQ?

To be specific, here are some examples of csv rows transformations that i need:

convert String $123,50 into integer (float or whatever) 123.5.
math operations on two or columns from a single row. E.g. i have price and number and i need to get total_price, which is price * number.

I see three questions here: 1) how to modify wrongly formatted data & 2)where better to make it happen - in CSV before loading to GBQ or in GBQ after loading from GCS 3) how to better orchestrate this process for big number of files. Which exactly question you are actually asking? I would recommend reading How to Ask — Mikhail Berlyant
#2. I don't have big number of files, just 10 files. 20 to 100mb each. But this procedure is repeatable, so it needs to be automated. — stkvtflw
got it. I would definitely vote for making this in GBQ as it is most flexible and fast way. you just load your files as is and then use BigQuery magic to transform your data — Mikhail Berlyant

Mikhail Berlyant Mikhail Berlyant · Accepted Answer · 2017-11-29T06:57:07

where better to make it happen - in CSV before loading to GBQ or in GBQ after loading from GCS?

I recommend making this in GBQ as it is most flexible and fast way.
You just load your files as is and then use BigQuery magic to transform your data

As for automation - you can use any client of your preference to script the process

convert String $123,50 into integer (float or whatever) 123.5.
math operations on two or columns from a single row. E.g. i have price and number and i need to get total_price, which is price * number.

See example below (for BigQuery Standard SQL)

#standardSQL
WITH `project.dataset.yourTable` AS (
  SELECT '$123,50' AS price, 5 AS number UNION ALL
  SELECT '$1,99', 99
)
SELECT 
  CAST(REPLACE(REPLACE(price, ',', '.'), '$', '') AS FLOAT64) AS price,
  CAST(REPLACE(REPLACE(price, ',', '.'), '$', '') AS FLOAT64) * number AS total_price
FROM `project.dataset.yourTable`

or, to avoid redundant CAST(REPLACE(REPLACE(price, ',', '.'), '$', '') AS FLOAT64) you can use below approach

#standardSQL
WITH `project.dataset.yourTable` AS (
  SELECT '$123,50' AS price, 5 AS number UNION ALL
  SELECT '$1,99', 99
)
SELECT 
  price,
  price * number AS total_price
FROM `project.dataset.yourTable`, 
UNNEST([CAST(REPLACE(REPLACE(price, ',', '.'), '$', '') AS FLOAT64)]) AS price

in both cases - result is as below

price   total_price  
123.5   617.5    
1.99    197.01

Sanitise .csv before importing into Google BigQuery

2 Answers