Approach 1: My input data is bunch of json files. After preprocessing, the output is in pandas dataframe format which will be written to Azure SQL Database table.
Approach 2: I had implemented the delta lake where the output pandas dataframe is converted to Spark dataframe and then data inserted to Partitioned Delta Table. The process is simple and also time required to convert pandas dataframe to spark dataframe is in milliseconds. But the performance compared to Approach 1 is bad. With Approach1, I am able to finish in less than half of the time required by Approach 2.
I tried with different optimizing techniques like ZORDER, Compaction (bin-packing), using insertInto rather than saveAsTable. But none really improved the performance.
Please let me know if I had missed any performance tuning methods. And if there are none, I am curious to know why Delta Lake did not perform better than pandas+database approach. And also, I am happy to know any other better approaches. For example, I came across dask.
Many Thanks for your Answers in advance.
Regards, Chaitanya