I have bunch of big csv files to be imported to BigQuery. Some of them contain rows with wrongly formatted data, some rows need to be modified (e.g. summarize two columns from the row) etc. What is the best way to clean this up? Should i just upload all the dirty data into a table and then run a Cloud Function script on each row to sanitize it, modify if needed and store into another "clean" table? Or should i do this with App Engine? Or may be i should make App Engine to process each csv and insert row by row into BigQuery?
Update:
I haven't found a way to run transformations/validations within GBQ without pulling this data to the app, which invokes it. For example, an API server (Kubernetes Engine) is responsible for file uploading. Then it triggers files import. But then what? NodeJS SDK has CRUD operations, but it requires to pull data from GBQ, which is severely inefficient. Is there way to trigger changes within GBQ?
To be specific, here are some examples of csv rows transformations that i need:
- convert String
$123,50into integer (float or whatever)123.5. - math operations on two or columns from a single row. E.g. i have
priceandnumberand i need to gettotal_price, which isprice * number.