I am new to Azure Databricks. I have two input files and python AI model, I am cleaning the input files and applying AI model on input Files to get final probabilities. Reading files, loading model, cleaning data, preprocessing the data and displaying output with probabilities taking me only few minutes.
But while I am trying to write the result to Table or parquet file it is taking me more than 4-5 hours. I have tried various approaches of repartition/partitionBy/saveAsTable but none of it is fast enough.
My output spark dataframe consists of three columns with 120000000 rows. My shared cluster size is 9-Node cluster with each Node of 56GB memory.
My doubts are :- 1.) Is it expected behavior in azure databricks with slow writing capabilities. 2.) Is it true that we can't tune spark configurations in azure databricks, azure databricks tunes itself with available memory.