5
votes

I've been reading few questions regarding this topic and also several forums, and in all of them they seem to be mentioning that each of resulting .parquet files coming out from Spark should be either 64MB or 1GB size, but still can't make my mind around which case scenarios belong to each of those file sizes and the reasons behind apart from HDFS splitting them in 64MB blocks.

My current testing scenario is the following.

dataset
  .coalesce(n) # being 'n' 4 or 48 - reasons explained below.
  .write
  .mode(SaveMode.Append)
  .partitionBy(CONSTANTS)
  .option("basepath", outputPath)
  .parquet(outputPath)

I'm currently handling a total of 2.5GB to 3GB of daily data, that will be split and saved into daily buckets per year. The reasons behind 'n' being 4 or 48 is just for testing purposes, as I know the size of my testing set in advance, I try to get a number as close to 64MB or 1GB as I can. I haven't implemented code to buffer the needed data until I get the exact size I need prior saving.

So my question here is...

Should I take the size that much into account if I'm not planning to use HDFS and merely store and retrieve data from S3?

And also, which should be the optimal size for daily datasets of around 10GB maximum if I'm planning to use HDFS to store my resulting .parquet files?

Any other optimization tip would be really appreciated!

1

1 Answers

12
votes

You can control the split size of parquet files, provided you save them with a splittable compression like snappy. For the s3a connector, just set fs.s3a.block.size to a different number of bytes.

Smaller split size

  • More workers can work on a file simultaneously. Speedup if you have idle workers.
  • More startup overhead scheduling work, starting processing, committing tasks
  • Creates more files from the output, unless you repartition.

Small files vs large files

Small files:

  • you get that small split whether or not you want it.
  • even if you use unsplittable compression.
  • takes longer to list files. Listing directory trees on s3 is very slow
  • impossible to ask for larger block sizes than the file length
  • easier to save if your s3 client doesn't do incremental writes in blocks. (Hadoop 2.8+ does if you set spark.hadoop.fs.s3a.fast.upload true.

Personally, and this is opinion, and some benchmark driven -but not with your queries

Writing

  • save to larger files.
  • with snappy.
  • shallower+wider directory trees over deep and narrow

Reading

  • play with different block sizes; treat 32-64 MB as a minimum
  • Hadoop 3.1, use the zero-rename committers. Otherwise, switch to v2
  • if your FS connector supports this make sure random IO is turned on (hadoop-2.8 + spark.hadoop.fs.s3a.experimental.fadvise random
  • save to larger files via .repartion().
  • Keep an eye on how much data you are collecting, as it is very easy to run up large bills from storing lots of old data.

see also Improving Spark Performance with S3/ADLS/WASB