I am distributing some download tasks on a Spark cluster. The input comes from a source which cannot always be parallelized with Spark's normal methods like parallelize
or textFile
and so on. Instead, I have one service providing me a bunch of download tasks (URL + encapsulated logic to read and decrypt it) that I distribute using parallelize
.
When there are a few thousands tasks, Spark distributes the tasks equally to all the slaves, enabling the maximum level of parallelism. However, when there are oly a few hundred tasks, Spark thinks the dataset is tiny and can be computed on just a few slaves to reduce the communication time and increase the data locality. But this is wrong in my case, each task can produce thousands of JSON records, and I want the downloads to be performed by as many machines as I have in my cluster.
I have two ideas for the moment :
- using
repartition
to set the number of partitions to the number of cores - using
repartition
to set the number of partitions to the number download tasks
I don't like the first one because I have to pass the number of cores in a piece of my code which currently does not need to know how much resources it has. I have only one Spark job running at a time, but in the future I could have more of these, so I would actually have to pass the number of cores divided by the number of parallel jobs I want to run on the cluster. I don't like the second one either because partitioning into thousands of partition when I have only 40 nodes seems awkward.
Is there a way to tell Spark to distribute the elements of an RDD as much as possible ? If not, which one of the two options is preferrable ?
parallelize
and also that you useparallelize
. Do I understand that correctly? :) – Daniel Darabosparallelize
, instead you distribute the URLs usingparallelize
. Don't mind me... – Daniel Darabos