Spark 2.4+, the problem should be fixed, see @Rahul's comment below this answer.
Spark 2.1-2.3, the minPartitions argument of binaryFiles() is ignored. See Spark-16575 and the commit changes to function setMinPartitions(). Notice in the commit changes how minPartitions isn't used anymore in the function!
If you are reading multiple binary files with binaryFiles(), the input files will be coalesced into partitions based on the following:
spark.files.maxPartitionBytes, default 128 MB
spark.files.openCostInBytes, default 4 MB
spark.default.parallelism
- the total size of your input
The first three config items are described here. See the commit change above to see the actual calculation.
I had a scenario where I wanted a max of 40 MB per input partition, hence 40 MB per task... to increase parallelism while parsing. (Spark was putting 128 MB into each partition, slowing down my app.) I set spark.files.maxPartitionBytes to 40 M before calling binaryFiles():
spark = SparkSession \
.builder \
.config("spark.files.maxPartitionBytes", 40*1024*1024)
For only one input file, @user9864979's answer is correct: a single file cannot be split into multiple partitions using just binaryFiles().
When reading multiple files with Spark 1.6, the minPartitions argument does work, and you have to use it. If you don't, you'll experience the Spark-16575 problem: all of your input files will be read into only two partitions!
You will find that Spark will normally give you fewer input partitions than you request. I had a scenario where I wanted one input partition for every two input binary files. I found that setting minPartitions to "the # of input files * 7 / 10" gave me roughly what I wanted.
I had another scenario where I wanted one input partition for each input file. I found that setting minPartitions to "the # of input files * 2" gave me what I wanted.
Spark 1.5 behavior of binaryFiles(): you get one partition for each input file.