3
votes

Used Google Dataflow service to batch load the same 10k json records coming from Kafka into the Google cloud storage. Following was the break up of files generated with Apache Beam's AvroIO, ParquetIO, TextIO libraries respectively.

We assumed the parquet file size would be smaller in data footprint size as compared to avro for the GCP as HDP(Hortonworks) and CDH (Cloudera) showed similar studies as mentioned. https://stackoverflow.com/a/31093105/4250322

However, the results on this 10k records indicated smaller Avro size on GCS. Can this be assumed to choose the data format. What other factors to consider apart from advantages mentioned here: https://cloud.google.com/blog/products/gcp/improve-bigquery-ingestion-times-10x-by-using-avro-source-format

We wish to keep the GCS cost at minimal choosing the best format and keeping the overall cost minimal.


// using ParquetIO write as parquet output file
pCollectionGenericRecords.apply ("ParquetToGCS",FileIO.<GenericRecord>write().via(ParquetIO.sink(AVRO_SCHEMA))

// Using TextIO write as text output file
collection.apply(TextIO.write().to(stagingLocation));

// Using AvroIO write as avro output file
pCollectionGenericRecords.apply("AvroToGCS", AvroIO.writeGenericRecords(AVRO_SCHEMA)

Update based on the suggestion.

Processing 0.6 million json records with 259.48 MB using Dataflow service to avro vs parquet format generated following:

Avro output size = 52.8 MB

Parquet output size = 199.2 MB

In order to do larger scale test means using Dataflow service with costs, is there a already available study to leverage.

1

1 Answers

2
votes

You'd need bigger files to see benefits of parquet (you can expect all those studies are for files around 256Mb). For streaming usecases better stick to Avro.