Spark/pySpark: Best way to read small binary data files

Question

I need to read data from binary files. The files are small, of the order of 1 MB, so it's probably not efficient to use binaryFiles() and process them file by file (too much overhead).

I can join them in one big file and then use binaryRecords(), but the record size is just 512 bytes, so I'd like to concatenate several records together, in order to produce chunks of the size of tens of megabytes. The binary file format allows this.

How can I achieve this? More in general: Is this the right approach to the problem?

Thanks!

Mark Rajcok Mark Rajcok · Accepted Answer · 2018-07-21T21:41:17

As of Spark 2.1, binaryFiles() will coalesce multiple small input files into a partition (default is 128 MB per partition), so using binaryFiles() to read small files should be much more efficient now.

See also https://stackoverflow.com/a/51460293/215945 for more details about binaryFiles() and how to adjust the default 128 MB size (if desired).

Spark/pySpark: Best way to read small binary data files

2 Answers