0
votes

I'm not a developer so please bear with me on this. I wasn't able to follow the PHP-based answer at Google BigQuery - Automating a Cron Job, so I don't know if that's even the same thing as what I'm looking for.

Anyway, I use Google Cloud to store data, and several times throughout the day data is uploaded into CSVs there. I use BigQuery to run jobs to populate BigQuery tables with the data there.

Because of reasons beyond my control, the CSVs have duplicate data. So what I want to do is basically create a daily ETL to append all new data to the existing tables, perhaps running at 1 am every day:

  1. Identify new files that have not been added (something like date = today - 1)
  2. Run a job on all the CSVs from step 1 to convert them to a temporary BigQuery table
  3. De-dupe the BigQuery table via SQL (I can do this in a variety of ways)
  4. Insert the de-duped temp table into the BigQuery table.
  5. Delete the temp table

So basically I'm stuck at square 1 - I don't know how to do any of this in an automated fashion. I know BigQuery has an API, and there's some documentation on cron jobs, and there's something called Cloud Dataflow, but before going down those rabbit holes I was hoping someone else may have had experience with this and could give me some hints. Like I said, I'm not a developer so if there's a more simplistic way to accomplish this that would be easier for me to run with.

Thanks for any help anyone can provide!

2
Not sure why this is downvoted, I'm working through the documentation from Google and it's not helping.Sm Ldad

2 Answers

2
votes

There are a few ways to solve this, but I'd recommend something like this:

  1. Create a templated Dataflow pipeline to read from GCS (source) and write append to BigQuery (sink).
  2. Your pipeline can remove duplicates directly itself. See here and here.
  3. Create a cloud function to monitor your GCS bucket.
  4. When a new file arrives, your cloud function is triggered automatically, which calls your Dataflow pipeline to start reading the new file, deduping it and writing the results to BigQuery.
0
votes

So no offense to Graham Polley but I ended up using a different approach. Thanks to these pages (and a TON of random Batch file Google searching and trial and error):

  • how to get yesterday's date in a batch file
  • https://cloud.google.com/bigquery/bq-command-line-tool

    cscript //nologo C:\Desktop\yester.vbs > C:\Desktop\tempvar.txt &&

    set /p zvar =< C:\Desktop\tempvar.txt &&

    del C:\Desktop\tempvar.txt &&

    bq load --skip_leading_rows=1 data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 gs://mybucket/data/%%zvar:~0,4%%-%%zvar:~4,2%%-%%zvar:~6,2%%*.csv.gz Timestamp:TIMESTAMP,TransactionID:STRING &&

    bq query --destination_table=data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%2 "SELECT * FROM data.data%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 group by 1,2" &&

    bq cp -a data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2 data.data &&

    bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_1 &&

    bq rm -f data.data_%%zvar:~0,4%%%%zvar:~4,2%%%%zvar:~6,2%%_2

A VB script called yester.vbs prints out yesterday's date in YYYYMMDD format. This is saved as a variable which is used to search for yesterday's data files in GCS and output to a table, from which a de-duped (via grouping by all columns) table is created. This is then appended to the main table, and the two intermediate tables are deleted.

The double percent signs are shown because it's saved as .CMD file and run through Windows Task Scheduler.