I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.
Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.
Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:
- Identify new files that have not been added (something like date = today - 1)
- Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
- De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
- Insert the de-duped temp table into the BigQuery table.
- Delete the temp table
So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.
Thanks for any help anyone can provide!