Before talking about the parquet side of the equation, one thing to consider is how the data will be used after you save it to parquet.
If it's going to be read/processed often, you may want to consider what are the access patterns and decide to partition it accordingly.
One common pattern is partitioning by date, because most of our queries have a time range.
Partitioning your data appropriately will have a much bigger impact on performance on using that data after it is written.
Now, onto Parquet, the rule of thumb is for the parquet block size to be roughly the size of the underlying file system. That matters when you're using HDFS, but it doesn't matter much when you're using S3.
Again, the consideration for the Parquet block size, is how you're reading the data.
Since a Parquet block has to be basically reconstructed in memory, the larger it is, the more memory is needed downstream. You also will need fewer workers, so if your downstream workers have plenty of memory you can have larger parquet blocks as it will be slightly more efficient.
However, for better scalability, it's usually better having several smaller objects - especially according to some partitioning scheme - versus one large object, which may act as a performance bottleneck, depending on your use case.
To sum it up:
- a larger parquet block size means slightly smaller file size (since compression works better on large files) but larger memory footprint when serializing/deserializing
- the optimal file size depends on your setup
- if you store 30GB with 512MB parquet block size, since Parquet is a splittable file system and spark relies on HDFS
getSplits()
the first step in your spark job will have 60 tasks. They will use byte-range fetches to get different parts of the same S3 object in parallel. However, you'll get better performance if you break it down in several smaller (preferably partitioned) S3 objects, since they can be written in parallel (one large file has to be written sequentially) and also most likely have better reading performance when accessed by a large number of readers.