Strategy for loading data into BigQuery and Google cloud Storage from local disk

Question

I have 2 years of combined data of size around 300GB in my local disk which i have extracted from teradata. I have to load the same data to both google cloud storage and BigQuery table.

The final data in google cloud storage should be day wise segregated in compressed format(each day file should be a single file in gz format). I also have to load the data in BigQuery in a day wise partitioned table i.e. each day's data should be stored in one partition.

I loaded the combined data of 2 years to google storage first. Then tried using google dataflow to day wise segregate data by using the concept of partitioning in dataflow and load it to google cloud storage (FYI dataflow partitioning is different from bigquery partitioning). But dataflow did not allow to create 730 partitions(for 2 years) as it hit the 413 Request Entity Too Large (The size of serialized JSON representation of the pipeline exceeds the allowable limit").

So I ran the dataflow job twice which filtered data for each year. It filtered each one year's data and wrote it into separate files in google cloud storage but it could not compress it as dataflow currently cannot write to compressed files.

Seeing the first approach fail, I thought of filtering 1 the one year's data from the combined data using partioning in dataflow as explained above and writing it directly to BigQuery and then exporting it to google storage in compressed format. This process would have been repeated twice. But in this approach i could not write more than 45 days data at once as I repeatedly hit java.lang.OutOfMemoryError: Java heap space issue. So this startegy also failed

Any help in figuring out a strategy for date wise segregated migration to google storage in compressed format and BigQuery would be of great help?

(I deleted an answer that was directed to a different question, apologies!) — Felipe Hoffa

Ben Chambers Ben Chambers · Accepted Answer · 2016-08-11T18:05:23

Currently, partitioning the results is the best way to produce multiple output files/tables. What you're likely running into is the fact that each write allocates a buffer for the uploads, so if you have a partition followed by N writes, there are N buffers.

There are two strategies for making this work.

You can reduce the size of the upload buffers using the uploadBufferSizeBytes option in GcsOptions. Note that this may slow down the uploads since the buffers will need to be flushed more frequently.
You can apply a Reshuffle operation to each PCollection after the partition. This will limit the number of concurrent BigQuery sinks running simultaneously, so fewer buffers will be allocated.

For example, you could do something like:

PCollection<Data> allData = ...;
PCollectionList<Data> partitions = allData.apply(Partition.of(...));

// Assuming the partitioning function has produced numDays partitions,
// and those can be mapped back to the day in some meaningful way:
for (int i = 0; i < numDays; i++) {
  String outputName = nameFor(i); // compute the output name
  partitions.get(i)
    .apply("Write_" + outputName, ReshuffleAndWrite(outputName));
}

That makes use of these two helper PTransforms:

private static class Reshuffle<T>
  extends PTransform<PCollection<T>, PCollection<T>> {
  @Override
  public PCollection<T> apply(PCollection<T> in) {
    return in
      .apply("Random Key", WithKeys.of(
          new SerializableFunction<T, Integer>() {
            @Override
            public Integer apply(Data value) {
              return ThreadLocalRandom.current().nextInt();
            }
          }))
      .apply("Shuffle", GroupByKey.<Integer, T>create())
      .apply("Remove Key", Values.create());
  }
}

private static class ReshuffleAndWrite 
  extends PTransform<PCollection<Data>, PDone> {

  private final String outputName;
  public ReshuffleAndWrite(String outputName) {
    this.outputName = outputName;
  }

  @Override
  public PDone apply(PCollection<Data> in) {
    return in
      .apply("Reshuffle", new Reshuffle<Data>())
      .apply("Write", BigQueryIO.Write.to(tableNameFor(outputName)
        .withSchema(schema)
        .withWriteDisposition(WriteDisposition.WRITE_TRUNCATE));
  }
}

Strategy for loading data into BigQuery and Google cloud Storage from local disk

2 Answers