1
votes

I am new to Azure Databricks. I have two input files and python AI model, I am cleaning the input files and applying AI model on input Files to get final probabilities. Reading files, loading model, cleaning data, preprocessing the data and displaying output with probabilities taking me only few minutes.

But while I am trying to write the result to Table or parquet file it is taking me more than 4-5 hours. I have tried various approaches of repartition/partitionBy/saveAsTable but none of it is fast enough.

My output spark dataframe consists of three columns with 120000000 rows. My shared cluster size is 9-Node cluster with each Node of 56GB memory.

My doubts are :- 1.) Is it expected behavior in azure databricks with slow writing capabilities. 2.) Is it true that we can't tune spark configurations in azure databricks, azure databricks tunes itself with available memory.

2

2 Answers

0
votes

The performance depends on multiple factors: To investigate further, could you please share the below details:

  • What is the size of the data?

  • What is the size of the worker type?

  • Share the code which you are running?

I would suggest you go through the below articles, which helps to improve the performance:

0
votes
  1. I have used azure databricks and have written data to azure storage and it has been fast.
  2. Also the databricks is hosted on Azure like in Aws.So all configurations of spark can be set.

As pradeep asked, what is the datasize and number of partitions? you can get it using df.rdd.getNumPartitions(). Have you tried a repartition before write? Thanks.