Google Dataflow not reading more than 3 input compressed files at once when there are multiple sources

Question

Background: I have 30 days data in 30 separate compressed files stored in google storage. I have to write them to a BigQuery table in 30 different partitions in the same table. Each compressed file size was around 750MB.

I did 2 experiments on the same data set on Google Dataflow today.

Experiment 1: I read each day's compressed file using TextIO, applied a simple ParDo transform to prepare TableRow objects and wrote them directly to BigQuery using BigQueryIO. So basically 30 pairs of parallel unconnected sources and and sinks got created. But I found that at any point of time, only 3 files were read, transformed and written to BigQuery. The ParDo transformation and BigQuery writing speed of Google Dataflow was around 6000-8000 elements/sec at any point in time. So only 3 source and sinks were being processed out of 30 at any time which significantly slowed the process. In over 90 minutes only 7 out 30 files were written to separate BigQuery partitions of a table.

Experiment 2: Here I first read each day's data from the same compressed file for 30 days, applied ParDo transformation on these the 30 PCollections and stored these 30 resultant Pcollections in a PCollectionList object. All these 30 TextIO sources were being read in parallel. Now I wrote each PCollection corresponding to each day's data in the PCollectionList to BigQuery using BigQueryIO directly. So 30 sinks were being written into again in parallel. I found that out of 30 parallel sources, again only 3 sources were being read and applied ParDo transformation at a speed of around 20000 elements/sec. At the time of writing of this question when 1 hr had already elapsed, reading from the all the compressed file had not even read completely 50% of the files and writing to the BigQuery table partitions had not even started.

These problems seem to occur only when Google Dataflow reads compressed files. I had asked a question about its slow reading from compressed files(Relatively poor performance when reading compressed files vis a vis normal text files kept in google storage using google dataflow) and was told that parallelizing work would make reading faster as only 1 worker reads a compressed file and multiple sources would mean multiple workers being given chance to read multiple files. But this also does not seem to be working.

Is there any way to speed up this whole process of reading from multiple compressed files and writing to separate partitions of the same table in BigQuery in dataflow job at the same time?

1) Are you using the Java SDK? If so..2) What version of the Java SDK are you using? 3) What are you setting the compressionType to in the TextIO.Read bound (AUTO, GZIP, ..) ? — Graham Polley
I am using Google Cloud Dataflow Java SDK 1.6.0. I am not setting any compression type while reading. So the compression type by default should be set to "AUTO". The file extention which the code is running is .gz — abhishek jha
Can you provide job IDs? How many workers are you using (the default is 3, I believe)? — Ben Chambers
The job Id was 2016-08-16_12_21_50-6508500558826000885 for experiment 1. The job Id for experiment 2 was 2016-08-16_12_59_18-12710516602377435100 — abhishek jha
Both of those jobs ran on just three workers. You can set the maxNumWorkers option to adjust the maximum number of workers to use, and numWorkers to set the initial number. In both of those pipelines, you seem to have set an option that you created named numberOfWorkers, instead of setting the options that the service understands — danielm

danielm danielm · Accepted Answer · 2016-08-22T22:43:40

Each compressed file will be read by a single worker. The initial number of workers for a job can be increased with the numWorkers pipeline option, and the maximum number that can be scaled up to can be set with the maxNumWorkers pipeline option.

Google Dataflow not reading more than 3 input compressed files at once when there are multiple sources

1 Answers