Background: I have 30 days data in 30 separate compressed files stored in google storage. I have to write them to a BigQuery table in 30 different partitions in the same table. Each compressed file size was around 750MB.
I did 2 experiments on the same data set on Google Dataflow today.
Experiment 1: I read each day's compressed file using TextIO, applied a simple ParDo transform to prepare TableRow objects and wrote them directly to BigQuery using BigQueryIO. So basically 30 pairs of parallel unconnected sources and and sinks got created. But I found that at any point of time, only 3 files were read, transformed and written to BigQuery. The ParDo transformation and BigQuery writing speed of Google Dataflow was around 6000-8000 elements/sec at any point in time. So only 3 source and sinks were being processed out of 30 at any time which significantly slowed the process. In over 90 minutes only 7 out 30 files were written to separate BigQuery partitions of a table.
Experiment 2: Here I first read each day's data from the same compressed file for 30 days, applied ParDo transformation on these the 30 PCollections and stored these 30 resultant Pcollections in a PCollectionList object. All these 30 TextIO sources were being read in parallel. Now I wrote each PCollection corresponding to each day's data in the PCollectionList to BigQuery using BigQueryIO directly. So 30 sinks were being written into again in parallel. I found that out of 30 parallel sources, again only 3 sources were being read and applied ParDo transformation at a speed of around 20000 elements/sec. At the time of writing of this question when 1 hr had already elapsed, reading from the all the compressed file had not even read completely 50% of the files and writing to the BigQuery table partitions had not even started.
These problems seem to occur only when Google Dataflow reads compressed files. I had asked a question about its slow reading from compressed files(Relatively poor performance when reading compressed files vis a vis normal text files kept in google storage using google dataflow) and was told that parallelizing work would make reading faster as only 1 worker reads a compressed file and multiple sources would mean multiple workers being given chance to read multiple files. But this also does not seem to be working.
Is there any way to speed up this whole process of reading from multiple compressed files and writing to separate partitions of the same table in BigQuery in dataflow job at the same time?