Relatively poor performance when reading compressed files vis a vis normal text files kept in google storage using google dataflow

Question

I used google dataflow to read an 11.57GB file from cloud storage and wrote them to google BigQuery. It took around 12 mins with 30 workers.

I then compressed the same file(size now became 1.06GB) and then again read them from google storage using google dataflow and wrote them to BigQuery. It now took around 31 mins with same 30 workers.

Both the dataflow jobs had same pipeline options except the input file in first dataflow job was uncompressed but the input file was compressed in the second datatflow job.

It seems there is a huge drop in performance when google dataflow reads compressed files.

The speed of ParDo transform and BigQueryIO transform drops by more 50% when reading compressed files.

It does not seem to improve even when I increase the number of workers to 200 as it still took 28mins to read the same compressed file and write to bigquery

Is there a way to speed the entire process when reading compressed files?

Guessing - compressed as one file? It has to be completely uncompressed before it can start to be processed? Compress as 'chunks' so it can be uncompressed in parallel? — Ryan Vincent
@Ryan: Yes the file is compressed as one file and it has to remain compressed as one file. It cannot be compressed in chunks. So we cannot uncompress in parallel. — abhishek jha

danielm danielm · Accepted Answer · 2016-08-08T19:07:58

When reading from compressed data, each file can only be processed by one worker; when reading from uncompressed data the work can be parallelized much better. Since you have only one file, that explains the performance difference you are seeing.

The best options for speeding this up are to use uncompressed input, or using multiple smaller files. Alternatively, to reduce costs, you could run on fewer workers when reading the one compressed file.

Relatively poor performance when reading compressed files vis a vis normal text files kept in google storage using google dataflow

1 Answers