0
votes

I am using DynamicDestinations (from BigQueryIO) to export data from one Cassandra table to multiple Google BigQuery tables. The process consists of several steps including writing prepared data to Google Cloud Storage (as files in JSON format) and then loading the files to BQ via load jobs.

The problem is that export process has ended with out of memory error at the last step (loading files from Google Storage to BQ). But there are prepared files with all of the data in GCS remaining. There are 3 directories in BigQueryWriteTemp location:

enter image description here

And there a lot of files with not obvious names:

enter image description here

The question is what is the storage structure of the files? How can I match the files with tables (table names) they prepared for? How can I use the files to continue export process from load jobs step? Can I use some piece of Beam code for that?

1

1 Answers

3
votes

These files, if you're using Beam 2.3.0 or earlier, contain JSON data to be imported into BigQuery using its load job API. However:

  • This is an implementation detail that you can not rely on, in general. It is very likely to change in future versions of Beam (JSON is horribly inefficient).
  • It is not possible to match these files with the tables they are intended for - that was stored in the internal state of the pipeline that has failed.
  • There is also no way to know how much data was written to these files and how much wasn't. The files may contain only partial data: maybe your pipeline failed before creating some of the files, or after some of them were already loaded into BigQuery and deleted.

Basically, you'll need to rerun the pipeline and fix the OOM issue so that it succeeds.

For debugging OOM issues, I suggest using a heap dump. Dataflow can write heap dumps to GCS using --dumpHeapOnOOM --saveHeapDumpsToGcsPath=gs://my_bucket/. You can examine these dumps using any Java memory profiler, such as Eclipse MAT or YourKit. You can also post your code as a separate SO question and ask for advice reducing its memory usage.