I used google dataflow to read an 11.57GB file from cloud storage and wrote them to google BigQuery. It took around 12 mins with 30 workers.
I then compressed the same file(size now became 1.06GB) and then again read them from google storage using google dataflow and wrote them to BigQuery. It now took around 31 mins with same 30 workers.
Both the dataflow jobs had same pipeline options except the input file in first dataflow job was uncompressed but the input file was compressed in the second datatflow job.
It seems there is a huge drop in performance when google dataflow reads compressed files.
The speed of ParDo transform and BigQueryIO transform drops by more 50% when reading compressed files.
It does not seem to improve even when I increase the number of workers to 200 as it still took 28mins to read the same compressed file and write to bigquery
Is there a way to speed the entire process when reading compressed files?