0
votes

I am trying to do a load operation in bigquery from GCS files using load_job in ruby.

The problem is, when I have multiple files in GCS affecting different tables, there's a chance some might fail due to validation/network issues, leading to inconsistent data in bigquery. Let's say I want to load last hour data which is stored in 5 files, even if 1 of these load jobs fail, I'll be having bad data for analytics.

Is there a way I can batch all these load jobs in a single atomic request to bigquery?

1
you share some code how you are trying to do it. are you able to catch failures and retry in case of error ? What about a temporary table to ensure move the data properly to bigquery and after that move them to your final tables?hlagos
@hlagos, even if I create temporary tables, these problems will still exist when sending the copy request because they will also create different jobs in BQ, or am I missing something with that approach?Akhil Yadav
if your concern is about network errors submiting to BQ, it shouldn't be a case if you ensure that all your data is inside of bigquery, I would expect it to be much more stable once all your data is in temp tables inside bigquery and perform the operations from table to tablehlagos

1 Answers

0
votes

Why don't you try a BQ Sink or streaming data into BQ. With Sinks you will be counting on the BQ underlining architecture which is quite stable and consistent to move data from text files to BQ tables. With streaming data, you will have more control over your transactions. You can then insure that you're data is moved correctly line by line.