I'm using Azure Databricks to deploy some R code parallelized through several workers using SparkR and gapplyCollect().
Project overview
- I have 10000+ similar data sources generating a lot of transactional information to be analyzed on a daily basis;
- I have an R function that analyzes all the information of 1 data source at a time, giving me some valuable insights about that specific data source;
- So, every day I need to execute my R function 10000+ times for analyzing all my data.
Code Logic
- Read all the data (from a relational DB) as a
SparkDataframe groupBy()theSparkDataframebased on my data source column (data is evenly distributed by the data source column)- Use
gapplyCollect()on theGroupedDataresult of last step, for applying my R function on each data partition.- The result of each execution is a small R
DataFramewith a few rows (dim == (5,5)). - All of the results are joined by
gapplyCollect()execution, generating a small RDataFrame(<100k numerical rows) that consolidates all the results.
- The result of each execution is a small R
- Persist the result on my DBFS.
My issue
In my dev environment I'm making some experimentation parallelizing the analysis of 250 data sources, and I'm using 4 small workers for that (VM type: Standard_F4s).
gapplyCollect() is sending my R function to the workers, but... Is it possible to customize the maximum number of active tasks to be executed per worker? As a default, I'm seeing that Databricks is allowing 5 Active Tasks per worker.
- For example: How can I allow 8 tasks to be executed in parallel on each worker? Is
spark-submitsuitable for this task?
I've never used spark-submit and I didn't find a good documentation for using it on Azure Databricks.
I'll bring this code to production using a daily scheduled job. In the job, can I use for example the
spark-submitfeature to change the--executor-coresoption?If yes, how do I guarantee that I'm installing the
forecastCRAN Library in all my job driver + workers, since Azure Databricks don't allow me to define libraries in the GUI when using thespark-submit?


