Actually, you might not need 1000 jobs to read from 1000 files in HDFS. You could try to load everything in a single RDD as well (the APIs do support reading multiple files and wildcards in paths). Now, after reading all the files in a single RDD, you should really focus on ensuring if you have enough memory, cores, etc. assigned to it and start looking at your business logic which avoids costly operations like shuffles, etc.
But, if you insist that you need to spawn 1000 jobs, one for each file, you should look at --executor-memory
and --executor-cores
(along with num-executors
for parallelism). These give you leverage to optimise for memory/CPU footprint.
Also curious, you are saying that you get OOM during spark-submit (using driver memory). The driver doesn't really use any memory at all, unless you do things like collect
or take
with large set, which bring the data from the executors to the driver. Also you are firing the jobs in yarn-client
mode? Another hunch is to check if the box where you spawn spark spark jobs has even enough memory just to spawn the jobs in the first place?
It will be easier if you could also paste some logs here.