0
votes

I'm using Azure Databricks to deploy some R code parallelized through several workers using SparkR and gapplyCollect().


Project overview

  • I have 10000+ similar data sources generating a lot of transactional information to be analyzed on a daily basis;
  • I have an R function that analyzes all the information of 1 data source at a time, giving me some valuable insights about that specific data source;
  • So, every day I need to execute my R function 10000+ times for analyzing all my data.

Code Logic

  1. Read all the data (from a relational DB) as a SparkDataframe
  2. groupBy() the SparkDataframe based on my data source column (data is evenly distributed by the data source column)
  3. Use gapplyCollect() on the GroupedData result of last step, for applying my R function on each data partition.
    • The result of each execution is a small R DataFrame with a few rows (dim == (5,5)).
    • All of the results are joined by gapplyCollect() execution, generating a small R DataFrame (<100k numerical rows) that consolidates all the results.
  4. Persist the result on my DBFS.

My issue

In my dev environment I'm making some experimentation parallelizing the analysis of 250 data sources, and I'm using 4 small workers for that (VM type: Standard_F4s).

gapplyCollect() is sending my R function to the workers, but... Is it possible to customize the maximum number of active tasks to be executed per worker? As a default, I'm seeing that Databricks is allowing 5 Active Tasks per worker.

Azure Databricks maximum tasks per worker

  • For example: How can I allow 8 tasks to be executed in parallel on each worker? Is spark-submit suitable for this task?

I've never used spark-submit and I didn't find a good documentation for using it on Azure Databricks.

  • I'll bring this code to production using a daily scheduled job. In the job, can I use for example the spark-submit feature to change the --executor-cores option?

  • If yes, how do I guarantee that I'm installing the forecast CRAN Library in all my job driver + workers, since Azure Databricks don't allow me to define libraries in the GUI when using the spark-submit?

1
each core can only run 1 task at a time, have you tried increasing the number of cores of each executor? - Minh Thai
For such a workload you are going to need bigger instance and configure dynamic resource allocation databricks.com/session/… - sramalingam24
@MinhThai Yes I've tried and it increases the number of active tasks and it speeds up the whole processing time. But what I was trying to optimize now is the ratio between tasks/core, test if changing this ratio down or up, then I can further improve execution time. - Renan V. Novas
@sramalingam24 As far as I understand dynamic resource allocation wouldn't be a good option for this usage case, since I want to create a cluster on demand, execute all my 10000+ computations and then terminate the cluster. This cluster will be created and terminated automatically based on a scheduled pipeline. I don't have a variable load job and I won't use a shared cluster. - Renan V. Novas

1 Answers

0
votes

I've accessed the Databricks' Managed Resource Group with all the internally created resources (like VMs, Disks & Network Interfaces).

There I've checked the CPU consumption metrics for each one of my workers. Here's the result for a 2x Worker cluster:

Databricks workers CPU usage metrics

Here's the same chart capturing the moment when the tasks finally ended:

Databricks workers CPU usage metrics 2

Based on these metrics we can see that:

  • Average CPU usage is 85~87%
  • Max CPU usage is 92~96%
  • Min CPU usage is 70~80%

These metrics are ok for my use case... But if anyone have any clues about how to use spark-submit with Databricks, please feel free to share a new answer here.