5
votes

I have around 100 GB of data per day which I write to S3 using Spark. The write format is parquet. The application which writes this run Spark 2.3

The 100 GB data is further partitioned, where the largest partition is 30 GB. For this case, let's just consider that 30 GB partition.

We are planning to migrate this whole data and rewrite to S3, in Spark 2.4. Initially we didn't decide on file size and block size when writing to S3. Now that we are going to rewrite everything, we want to take into consideration the optimal file size and parquet block size.

  1. What is the optimal file size to write to S3 in parquet ?
  2. Can we write 1 file with 30 GB size and parquet block size as 512 MB ? How will reading work in this case ?
  3. Same as #2 but parquet block size as 1 GB ?
1

1 Answers

11
votes

Before talking about the parquet side of the equation, one thing to consider is how the data will be used after you save it to parquet. If it's going to be read/processed often, you may want to consider what are the access patterns and decide to partition it accordingly. One common pattern is partitioning by date, because most of our queries have a time range. Partitioning your data appropriately will have a much bigger impact on performance on using that data after it is written.

Now, onto Parquet, the rule of thumb is for the parquet block size to be roughly the size of the underlying file system. That matters when you're using HDFS, but it doesn't matter much when you're using S3.

Again, the consideration for the Parquet block size, is how you're reading the data. Since a Parquet block has to be basically reconstructed in memory, the larger it is, the more memory is needed downstream. You also will need fewer workers, so if your downstream workers have plenty of memory you can have larger parquet blocks as it will be slightly more efficient.

However, for better scalability, it's usually better having several smaller objects - especially according to some partitioning scheme - versus one large object, which may act as a performance bottleneck, depending on your use case.

To sum it up:

  • a larger parquet block size means slightly smaller file size (since compression works better on large files) but larger memory footprint when serializing/deserializing
  • the optimal file size depends on your setup
  • if you store 30GB with 512MB parquet block size, since Parquet is a splittable file system and spark relies on HDFS getSplits() the first step in your spark job will have 60 tasks. They will use byte-range fetches to get different parts of the same S3 object in parallel. However, you'll get better performance if you break it down in several smaller (preferably partitioned) S3 objects, since they can be written in parallel (one large file has to be written sequentially) and also most likely have better reading performance when accessed by a large number of readers.