Writing spark dataframe in Azure Databricks

Question

I am new to Azure Databricks. I have two input files and python AI model, I am cleaning the input files and applying AI model on input Files to get final probabilities. Reading files, loading model, cleaning data, preprocessing the data and displaying output with probabilities taking me only few minutes.

But while I am trying to write the result to Table or parquet file it is taking me more than 4-5 hours. I have tried various approaches of repartition/partitionBy/saveAsTable but none of it is fast enough.

My output spark dataframe consists of three columns with 120000000 rows. My shared cluster size is 9-Node cluster with each Node of 56GB memory.

My doubts are :- 1.) Is it expected behavior in azure databricks with slow writing capabilities. 2.) Is it true that we can't tune spark configurations in azure databricks, azure databricks tunes itself with available memory.

CHEEKATLAPRADEEP-MSFT CHEEKATLAPRADEEP-MSFT · Accepted Answer · 2020-11-17T10:28:41

The performance depends on multiple factors: To investigate further, could you please share the below details:

What is the size of the data?
What is the size of the worker type?
Share the code which you are running?

I would suggest you go through the below articles, which helps to improve the performance:

Writing spark dataframe in Azure Databricks

2 Answers