0
votes

I am distributing some download tasks on a Spark cluster. The input comes from a source which cannot always be parallelized with Spark's normal methods like parallelize or textFile and so on. Instead, I have one service providing me a bunch of download tasks (URL + encapsulated logic to read and decrypt it) that I distribute using parallelize.

When there are a few thousands tasks, Spark distributes the tasks equally to all the slaves, enabling the maximum level of parallelism. However, when there are oly a few hundred tasks, Spark thinks the dataset is tiny and can be computed on just a few slaves to reduce the communication time and increase the data locality. But this is wrong in my case, each task can produce thousands of JSON records, and I want the downloads to be performed by as many machines as I have in my cluster.

I have two ideas for the moment :

  • using repartition to set the number of partitions to the number of cores
  • using repartition to set the number of partitions to the number download tasks

I don't like the first one because I have to pass the number of cores in a piece of my code which currently does not need to know how much resources it has. I have only one Spark job running at a time, but in the future I could have more of these, so I would actually have to pass the number of cores divided by the number of parallel jobs I want to run on the cluster. I don't like the second one either because partitioning into thousands of partition when I have only 40 nodes seems awkward.

Is there a way to tell Spark to distribute the elements of an RDD as much as possible ? If not, which one of the two options is preferrable ?

1
You say you cannot use parallelize and also that you use parallelize. Do I understand that correctly? :)Daniel Darabos
Ah, I think I understand! You mean you don't have the data up-front, just the URLs. So you cannot distribute the data by parallelize, instead you distribute the URLs using parallelize. Don't mind me...Daniel Darabos
@DanielDarabos you got it rightDici

1 Answers

1
votes

If each download can produce a lot of records, and you don't have a lot of downloads (just a few thousand), I would recommend creating one partition per download.

The total overhead of scheduling a few thousand tasks is just a few seconds. We routinely have tens of thousands of partitions in production.

If you had several downloads in one partition, you could end up with very large partitions. If you have a partition which cannot fit into the available memory in its entirety twice, you are going to have issues with some operations. For example a join and distinct build hash tables of the entire partition.


You shouldn't need to use repartition. parallelize takes a second parameter, the number of partitions you want. While the list of URLs is not a large amount of data, it's better if you just create the RDD with the right number of partitions to begin with, instead of shuffling it after creation.